mdshw5 / pyfaidx

Efficient pythonic random access to fasta subsequences
https://pypi.python.org/pypi/pyfaidx
Other
449 stars 75 forks source link

Implement fsspec in place of open #168

Closed hardingnj closed 2 years ago

hardingnj commented 4 years ago

Hi, as discussed in #165 here is a PR leveraging fsspec.

The code changes are straightforward. Though I am having trouble getting tests to pass. Any pointers would be appreciated.

The other thing that might be worth considering is the use of Bio.bgzf. I think that reading of gz files can be done directly, as bgz should be fully compatible. It might be a case of using fsspec.open() with the compression set as "auto". Although there might be a use of bgz I am not aware of?

mdshw5 commented 4 years ago

Thanks for working on this @hardingnj! I'll take a look and see why tests aren't passing. The use of BGZF is meant to allow block-based access to gzip compressed files (https://blastedbio.blogspot.com/2011/11/bgzf-blocked-bigger-better-gzip.html). The Bio.bgzf module is in fact using the python gzip module, but with extra logic to allow construction of virtual offsets into these compressed blocks.

hardingnj commented 4 years ago

Hi @mdshw5 ,

Have added a bit more work. I still have test failures, but these I think are expected failures that might be handled ok in fsspec.

Also added an __array__ method to FastaRecord, which I think is the preferred way of creating numpy arrays.

mdshw5 commented 4 years ago

Thanks. The test that is failing is due to python's universal newline support. pyfaidx needs to account for the number of line terminator characters to properly build its index (https://github.com/mdshw5/pyfaidx/commit/f314b595af6cba52828f8ed756bf5abffcca6629). The solution is to revert to open the file in binary mode so we can count the line terminator characters.

mdshw5 commented 4 years ago

It seems like the Bio.bgzf module may be able to be registered using fsspec: https://github.com/intake/filesystem_spec/blob/master/fsspec/compression.py#L19. This will be necessary - and I'm actually confused about why my BGZF tests are currently passing without it.

hardingnj commented 4 years ago

Thanks. The test that is failing is due to python's universal newline support. pyfaidx needs to account for the number of line terminator characters to properly build its index (f314b59). The solution is to revert to open the file in binary mode so we can count the line terminator characters.

Ahah- I thought it was something to do with that... though I didn't understand why you expect different line lengths depending on the presence of windows line endings.

hardingnj commented 4 years ago

It seems like the Bio.bgzf module may be able to be registered using fsspec: https://github.com/intake/filesystem_spec/blob/master/fsspec/compression.py#L19. This will be necessary - and I'm actually confused about why my BGZF tests are currently passing without it.

I'm not sure why it's necessary though? We're not compressing the file, just constructing the index from blocks, unless I am misunderstanding something?

mdshw5 commented 4 years ago

It seems like the Bio.bgzf module may be able to be registered using fsspec: https://github.com/intake/filesystem_spec/blob/master/fsspec/compression.py#L19. This will be necessary - and I'm actually confused about why my BGZF tests are currently passing without it.

I'm not sure why it's necessary though? We're not compressing the file, just constructing the index from blocks, unless I am misunderstanding something?

It's necessary to be able to seek past all of the 64k blocks that don't contain the sequence you need, and also seek to the correct position within the block of interest. @peterjc did a fantastic job documenting his module, and the section on file handle offsets is more than worth your time to read. Thanks for your contributions here - I do think the fsspec module makes sense but I do want to make sure to incorporate BGZF support properly.

I've also looked more closely at the test results and my BGZF tests are indeed failing without the Bio.bgzf module. I've got some work in #164 that will make the BGZF support more robust and improve performance as well.

hardingnj commented 4 years ago

Thanks @mdshw5 . Of course- very happy to make the effort to integrate properly. I'll need to read a bit further into bgzf first though

manzt commented 2 years ago

I came across this PR yesterday when exploring if some of our API's could be changed to support fsspec file-like objects. It would be desirable to allow a users to pass their own fsspec.OpenFile if possible and only create a new one in __init__ if a string is provided. E.g.

    def __init__(self, filename, ...):
       if isinstance(filename, str):
         self.file = fsspec.open(filename)
       else:
         self.file = filename # fsspec.OpenFile

The advantage of this choice is that users can configure any file object they'd like (auth, caching, etc). It's important to note that fsspec.OpenFile objects don't actually hold an low-level file object but are containers. Low-level file objects are only created when the with context manger is used:

with self.file as f:
  f # now low-level file object is created

This means it's OK to pass around these objects rather than an opener function. In addition, the Bio.bgzf.BgzfReader can be configured explicitly to wrap a file object via bgzf.BgzfReader(fileobj=fileobj) so useage could look something like:

with self.file as f:
  if self._bgzf:
      f = bgzf.BgzfReader(fileobj=f)
  ...

I tried to implement something on my own yesterday, but ran into some issues with how to handle the indexfile.

mdshw5 commented 2 years ago

@manzt Thanks for the detailed explanation and the example code. I have to admit that I did not understand fsspec but am now a bit more clear about why this is useful. If you want to contribute any code, even if it doesn't fully work I would be glad to collaborate.

hardingnj commented 2 years ago

Thanks @manzt and @mdshw5 , I really think this is worthwhile- happy to help develop the PR to address the above issues.

peterjc commented 2 years ago

Speaking for the Bio/bgzf.py code in Biopython, I'm open to (hopefully backward compatible) changes, but as yet am unfamiliar with fsspec. It may be only changes here in pyfaidx are needed.

manzt commented 2 years ago

It may be only changes here in pyfaidx are needed.

On first approximation, I don't think there will be any required changes to Bio/bgzf.py since BgzfReader can take a file-like object instead of a path.