mdshw5 / pyfaidx

Efficient pythonic random access to fasta subsequences
https://pypi.python.org/pypi/pyfaidx
Other
459 stars 75 forks source link

usage on google cloud storage and more general file handling #165

Closed hardingnj closed 2 years ago

hardingnj commented 4 years ago

Hi,

I would like to use pyfaidx on google cloud storage ie via gcsfs. I'm just working with uncompressed .fa for now, I think compressed may be a bit more complex.

The simplest fix would be to allow the user to specify a custom file opening function, in this case passing:

g = gcsfs.mapping.GCSFileSystem()
fa_fsmap = g.get_mapper(fa_path)
g.open(fa_fsmap)

pyfaidx.Fasta(fa_path, fasta_opener=...)

This is a bit awkward though, as the function couldn't operate on the filepath argument- it needs the mapper function from gcsfs.

I wondered if it's better to allow the user to pass in an open file handle directly, but I guess this makes working out the accompanying index file impossible. Unless this is also provided.

Given that then, would you consider delegating open to fsspec.open via https://github.com/intake/filesystem_spec? This would have the advantage of also supporting the bgzf opening in a better way than checking the file extension.

Happy to submit a PR with either solution.

mdshw5 commented 4 years ago

I didn't know about the fsspec package but that looks like a good solution. Feel free to open a PR if you like.