Closed hardingnj closed 2 years ago
Thanks for working on this @hardingnj! I'll take a look and see why tests aren't passing. The use of BGZF is meant to allow block-based access to gzip compressed files (https://blastedbio.blogspot.com/2011/11/bgzf-blocked-bigger-better-gzip.html). The Bio.bgzf module is in fact using the python gzip module, but with extra logic to allow construction of virtual offsets into these compressed blocks.
Hi @mdshw5 ,
Have added a bit more work. I still have test failures, but these I think are expected failures that might be handled ok in fsspec
.
Also added an __array__
method to FastaRecord
, which I think is the preferred way of creating numpy arrays.
Thanks. The test that is failing is due to python's universal newline support. pyfaidx
needs to account for the number of line terminator characters to properly build its index (https://github.com/mdshw5/pyfaidx/commit/f314b595af6cba52828f8ed756bf5abffcca6629). The solution is to revert to open the file in binary mode so we can count the line terminator characters.
It seems like the Bio.bgzf
module may be able to be registered using fsspec: https://github.com/intake/filesystem_spec/blob/master/fsspec/compression.py#L19. This will be necessary - and I'm actually confused about why my BGZF tests are currently passing without it.
Thanks. The test that is failing is due to python's universal newline support.
pyfaidx
needs to account for the number of line terminator characters to properly build its index (f314b59). The solution is to revert to open the file in binary mode so we can count the line terminator characters.
Ahah- I thought it was something to do with that... though I didn't understand why you expect different line lengths depending on the presence of windows line endings.
It seems like the
Bio.bgzf
module may be able to be registered using fsspec: https://github.com/intake/filesystem_spec/blob/master/fsspec/compression.py#L19. This will be necessary - and I'm actually confused about why my BGZF tests are currently passing without it.
I'm not sure why it's necessary though? We're not compressing the file, just constructing the index from blocks, unless I am misunderstanding something?
It seems like the Bio.bgzf module may be able to be registered using fsspec: https://github.com/intake/filesystem_spec/blob/master/fsspec/compression.py#L19. This will be necessary - and I'm actually confused about why my BGZF tests are currently passing without it.
I'm not sure why it's necessary though? We're not compressing the file, just constructing the index from blocks, unless I am misunderstanding something?
It's necessary to be able to seek past all of the 64k blocks that don't contain the sequence you need, and also seek to the correct position within the block of interest. @peterjc did a fantastic job documenting his module, and the section on file handle offsets is more than worth your time to read. Thanks for your contributions here - I do think the fsspec module makes sense but I do want to make sure to incorporate BGZF support properly.
I've also looked more closely at the test results and my BGZF tests are indeed failing without the Bio.bgzf
module. I've got some work in #164 that will make the BGZF support more robust and improve performance as well.
Thanks @mdshw5 . Of course- very happy to make the effort to integrate properly. I'll need to read a bit further into bgzf
first though
I came across this PR yesterday when exploring if some of our API's could be changed to support fsspec
file-like objects. It would be desirable to allow a users to pass their own fsspec.OpenFile
if possible and only create a new one in __init__
if a string is provided. E.g.
def __init__(self, filename, ...):
if isinstance(filename, str):
self.file = fsspec.open(filename)
else:
self.file = filename # fsspec.OpenFile
The advantage of this choice is that users can configure any file object they'd like (auth, caching, etc). It's important to note that fsspec.OpenFile
objects don't actually hold an low-level file object but are containers. Low-level file objects are only created when the with
context manger is used:
with self.file as f:
f # now low-level file object is created
This means it's OK to pass around these objects rather than an opener function. In addition, the Bio.bgzf.BgzfReader
can be configured explicitly to wrap a file object via bgzf.BgzfReader(fileobj=fileobj)
so useage could look something like:
with self.file as f:
if self._bgzf:
f = bgzf.BgzfReader(fileobj=f)
...
I tried to implement something on my own yesterday, but ran into some issues with how to handle the indexfile.
@manzt Thanks for the detailed explanation and the example code. I have to admit that I did not understand fsspec but am now a bit more clear about why this is useful. If you want to contribute any code, even if it doesn't fully work I would be glad to collaborate.
Thanks @manzt and @mdshw5 , I really think this is worthwhile- happy to help develop the PR to address the above issues.
Speaking for the Bio/bgzf.py
code in Biopython, I'm open to (hopefully backward compatible) changes, but as yet am unfamiliar with fsspec
. It may be only changes here in pyfaidx
are needed.
It may be only changes here in pyfaidx are needed.
On first approximation, I don't think there will be any required changes to Bio/bgzf.py
since BgzfReader
can take a file-like object instead of a path.
Hi, as discussed in #165 here is a PR leveraging fsspec.
The code changes are straightforward. Though I am having trouble getting tests to pass. Any pointers would be appreciated.
The other thing that might be worth considering is the use of
Bio.bgzf
. I think that reading ofgz
files can be done directly, asbgz
should be fully compatible. It might be a case of usingfsspec.open()
with the compression set as "auto". Although there might be a use ofbgz
I am not aware of?