initial version of spliced sequence retrieval

simonvh commented 6 years ago

Hi Matt, thanks for an extremely useful module. One feature that would be useful (at least to me) is the ability to retrieve spliced sequences from a FASTA file. Use-case would be to convert bed12 records to sequences, for instance. This PR is initial attempt at doing this. I would be happy to further discuss and adapt code/naming/style.

Some points:

Would an iterable of intervals be the right type of argument?
I personally prefer the option to get the reverse complement directly in one call. However, get_seq() doesn't have this. So at the moment this is not consistent.
Would it be worth adding this to the command line script?

mdshw5 commented 6 years ago

Thanks for the contribution. I like this idea very much, and it actually fits with some of my recent day-job work :).

Would an iterable of intervals be the right type of argument?

I think that an iterable of intervals makes sense. The spliced sequence should always be on the same contig, so I like your current interface.

I personally prefer the option to get the reverse complement directly in one call. However, get_seq() doesn't have this. So at the moment this is not consistent.

I'm okay with some inconsistency. We can add rc=False to get_seq() to keep things consistent - don't see anything wrong with that.

Would it be worth adding this to the command line script?

Certainly. I think we would need BED12 parsing to make this useful. I've also considered how to better integrate with gffutils. Currently there is a Feature.sequence method but that does not handle splicing of child introns in transcript feature types. I'm doing this GTF + FASTA -> transcriptome FASTA transformation quite a bit lately so moving the sequence splicing interface into pyfaidx would be helpful. Maybe @daler would have some thoughts about whether spliced sequence retrieval would be something appropriate for his module as well?

Let's get the tests passing and go from there!

simonvh commented 6 years ago

I see that there is one test failing relating to my work, I'll get fix that. However, this is not something that I did, right?

======================================================================

FAIL: Samtools BGZF index should be identical to pyfaidx BGZF index

----------------------------------------------------------------------

Traceback (most recent call last):

  File "/home/travis/build/mdshw5/pyfaidx/tests/test_Fasta_bgzip.py", line 48, in test_build_issue_126

    assert result_index == expect_index

AssertionError

I would be happy to add BED12 parsing. Not sure if there's actually a good (lightweight?) module that we could use there, otherwise I'll add it based on what I have for myself so far. It would indeed be great to have GFF/GTF support and if we could profit from @daler's work there.

mdshw5 commented 6 years ago

I think master was broken before your contribution. I was working on figuring our how samtools faidx works with BGZF compressed files, see #126. I abandoned the work and never came back :). I'll fix up master so the other tests pass.

mdshw5 commented 6 years ago

Alright, @simonvh. Thanks for the work on your end. After far too much messing around on my side, it look like master is passing tests as well. If you're okay with it I'll cut a new release with this PR merged. Just give a thumbs up if you're okay, otherwise I'll wait if you want to add anything else.

daler commented 6 years ago

@simonvh, @mdshw5 I'd love to have spliced sequence retrieval in gffutils. Like BED12 conversion (http://daler.github.io/gffutils/autodocs/gffutils.FeatureDB.bed12.html) it should actually be a method on FeatureDB rather than Feature. I'll add an issue over there to remind myself.

mdshw5 / pyfaidx

initial version of spliced sequence retrieval #127