Open eric-czech opened 4 years ago
We can certainly do that, but I wonder how gcsfuse is going to perform for large files anyway. This may be a reason to keep the PyBGEN reader around...
Is the metafile only a useful optimization after the file has been read in full at least once? I'm wondering if we should push upstream to have its use be optional. Most applications I can think of would only involve a few full scans of the data and no random access.
No, I think its use is crucial. Metafile construction is fairly cheap (cheaper than reading the whole file) and is needed to find the virtual offsets that allow the file to be read in parallel.
Ah gotcha. Do you have any suggestions on what our strategy should be for doing this on a cluster? I'm stumped as to where to go next with UKB.
On one hand it seems straightforward enough to simply copy certain files to local persistent disks but they're only broken up by chromosome so to get more than 25 reads in parallel I can see it getting a bit tricky to coordinate copies and reads potentially across chromosome boundaries.
Is there a BGEN file per chromosome? If so, one thing to try would be to copy the BGEN file to local persistent disk and run the conversion there (with multiple threads). This would give a baseline, and give an idea of how feasible it would be to convert each BGEN file on a separate cloud box (each with a hundred cores say).
Another approach would be to use PyBGEN (resurrecting the old code I wrote) and run it on a cluster against GCS directly. PyBGEN uses bgenix index files (not metafiles), which need to be generated before the BGEN files can be read. The bgenix utility is in C++ and doesn't support cloud stores. I think Glow might have support for that, or we could write something perhaps. It would be good to know how PyBGEN and bgen-reader compare peformance-wise before we do that though.
Is there a BGEN file per chromosome?
There is as well as an accompanying bgen index file.
Another approach would be to use PyBGEN (resurrecting the old code I wrote) and run it on a cluster against GCS directly
That's all here right?
It would be good to know how PyBGEN and bgen-reader compare peformance-wise before we do that though.
Makes sense. Given that the index files are there, presumably it would be pretty easy to figure out how much slower PyBGEN is for converting a smaller, local file.
That's all here right?
Correct, although it returns the old xarray representation.
Given that the index files are there, presumably it would be pretty easy to figure out how much slower PyBGEN is for converting a smaller, local file.
Good idea, maybe try on chr22?
Update: Copying the files locally has worked well enough and plays nicely with a typical snakemake-on-gke setup. Being able to specify the metafile and samples locations would still be helpful though, since UKB does not use the same name and it would provide some parity with sgkit-plink.
It would be useful for the
read_bgen
function to have arguments for custom metafile and sample paths. This should pass down to the bgen_reader.read_bgen arguments for those things.The pressing need I have for this is that bgen_reader is requiring filesystem properties that aren't supported by gcsfuse. Example:
I imagine I can get around this temporarily by having the metafiles written to a local directory instead, but I'm not sure how we should do this in a distributed environment with remote storage.