Add option to specify samples and metafile location

eric-czech commented 4 years ago

It would be useful for the read_bgen function to have arguments for custom metafile and sample paths. This should pass down to the bgen_reader.read_bgen arguments for those things.

The pressing need I have for this is that bgen_reader is requiring filesystem properties that aren't supported by gcsfuse. Example:

path = osp.expanduser('~/data/rs-ukb/raw-data/gt-imputation/ukb_imp_chr21_v3.bgen')
read_bgen(path)
# OSError: [Errno 5] Input/output error

---------------------------------------------------------------------------
OSError                                   Traceback (most recent call last)
 in 
----> 1 ds = load(21)
      2 ds

 in load(contig)
      2     path = osp.join(input_path, f'ukb_imp_chr{contig}_v3.bgen')
      3     print(path)
----> 4     ds = read_bgen(path)
      5     return ds

~/repos/sgkit-bgen/sgkit_bgen/bgen_reader.py in read_bgen(path, chunks, lock, persist)
    175     """
    176 
--> 177     bgen_reader = BgenReader(path, persist)
    178 
    179     variant_contig, variant_contig_names = encode_array(bgen_reader.contig.compute())

~/repos/sgkit-bgen/sgkit_bgen/bgen_reader.py in __init__(self, path, persist, dtype)
     46         self.path = Path(path)
     47 
---> 48         self.metafile_filepath = _infer_metafile_filepath(Path(self.path))
     49         if not self.metafile_filepath.exists():
     50             create_metafile(path, self.metafile_filepath, verbose=False)

~/miniconda3/envs/ukb-analysis/lib/python3.7/site-packages/bgen_reader/_reader.py in _infer_metafile_filepath(bgen_filepath)
    148             return BGEN_READER_CACHE_HOME / "metafile" / path_to_filename(metafile)
    149     else:
--> 150         if is_file_writable(metafile):
    151             return metafile
    152 

~/miniconda3/envs/ukb-analysis/lib/python3.7/site-packages/bgen_reader/_file.py in is_file_writable(filepath)
     41 def is_file_writable(filepath: Path):
     42     try:
---> 43         _touch(filepath)
     44     except PermissionError:
     45         return False

~/miniconda3/envs/ukb-analysis/lib/python3.7/site-packages/bgen_reader/_file.py in _touch(filepath, mode, dir_fd, **kwargs)
     86             f.fileno() if os.utime in os.supports_fd else filepath,
     87             dir_fd=None if os.supports_fd else dir_fd,
---> 88             **kwargs,
     89         )

OSError: [Errno 5] Input/output error

I imagine I can get around this temporarily by having the metafiles written to a local directory instead, but I'm not sure how we should do this in a distributed environment with remote storage.

tomwhite commented 4 years ago

We can certainly do that, but I wonder how gcsfuse is going to perform for large files anyway. This may be a reason to keep the PyBGEN reader around...

eric-czech commented 4 years ago

Is the metafile only a useful optimization after the file has been read in full at least once? I'm wondering if we should push upstream to have its use be optional. Most applications I can think of would only involve a few full scans of the data and no random access.

tomwhite commented 4 years ago

No, I think its use is crucial. Metafile construction is fairly cheap (cheaper than reading the whole file) and is needed to find the virtual offsets that allow the file to be read in parallel.

eric-czech commented 4 years ago

Ah gotcha. Do you have any suggestions on what our strategy should be for doing this on a cluster? I'm stumped as to where to go next with UKB.

On one hand it seems straightforward enough to simply copy certain files to local persistent disks but they're only broken up by chromosome so to get more than 25 reads in parallel I can see it getting a bit tricky to coordinate copies and reads potentially across chromosome boundaries.

tomwhite commented 4 years ago

Is there a BGEN file per chromosome? If so, one thing to try would be to copy the BGEN file to local persistent disk and run the conversion there (with multiple threads). This would give a baseline, and give an idea of how feasible it would be to convert each BGEN file on a separate cloud box (each with a hundred cores say).

Another approach would be to use PyBGEN (resurrecting the old code I wrote) and run it on a cluster against GCS directly. PyBGEN uses bgenix index files (not metafiles), which need to be generated before the BGEN files can be read. The bgenix utility is in C++ and doesn't support cloud stores. I think Glow might have support for that, or we could write something perhaps. It would be good to know how PyBGEN and bgen-reader compare peformance-wise before we do that though.

eric-czech commented 4 years ago

Is there a BGEN file per chromosome?

There is as well as an accompanying bgen index file.

Another approach would be to use PyBGEN (resurrecting the old code I wrote) and run it on a cluster against GCS directly

That's all here right?

It would be good to know how PyBGEN and bgen-reader compare peformance-wise before we do that though.

Makes sense. Given that the index files are there, presumably it would be pretty easy to figure out how much slower PyBGEN is for converting a smaller, local file.

tomwhite commented 4 years ago

That's all here right?

Correct, although it returns the old xarray representation.

Given that the index files are there, presumably it would be pretty easy to figure out how much slower PyBGEN is for converting a smaller, local file.

Good idea, maybe try on chr22?

eric-czech commented 4 years ago

Update: Copying the files locally has worked well enough and plays nicely with a typical snakemake-on-gke setup. Being able to specify the metafile and samples locations would still be helpful though, since UKB does not use the same name and it would provide some parity with sgkit-plink.

sgkit-dev / sgkit-bgen

Add option to specify samples and metafile location #6