Add support for wiggle coverage tracks.

sadams2013 commented 1 year ago

This adds support for wiggle coverage tracks (WIG files) by adding:

Entries for the '.wig' file extension.
A new reader class (WigReader) that parses and slices wig files.
Unit tests for the WigReader class.
Basic test data with two regions extracted from NIST data for NA12878 (genome in a bottle).

jrobinso commented 1 year ago

Thanks for the PR! A couple of quick comments. This is from code inspection, I didn't try anything, so I could be missing something.

Wig files are used for all sorts of data, not just coverage, so I suggest renaming "coverage.py" to "wig.py", or something generic.
igv.js supports a "track" line
The wig parser doesn't look complete. See http://rohsdb.cmb.usc.edu/goldenPath/help/wiggle.html
- It doesn't appear that "fixed step" is implemented. This option has a start and step size, in addition to an optional "span". It looks like a fixed step wig will be interpreted as a variable step
- the end position is not accounted for in "slice". in particular span is not accounted for. This will affect wig features whose start position is < region start but which overlap due to "span".

I don't think you can avoid parsing the wig lines to account for variable & fixed step, and the optional "span" variable. You might find this helpful (skip to line 397): https://github.com/igvteam/igv.js/blob/master/js/feature/decode/ucsc.js.

I think extending "FeatureReader" to handle wig files might be a better approach. To do this you would add a "parse" method in feature.py similar to the igv.js javascript referenced above. See the parse_bed method for an example The parse_wigmethod would return a Feature with chr, start, end, and line => line being the literal line of text parsed. One advantage to this, other than fitting in with the existing framework, is an interval tree is built for the features to support fast slicing. In the implementation here a linear search is done through the features for each chromosome for every region. With a large wig file, and many regions, this could get slow.

sadams2013 commented 1 year ago

Thank you for reviewing!

I agree - the implementation is lacking for several use cases. My approach was a bit myopic in looking to support my own (narrow) use case.

I will work through your suggestions and recommit.

sadams2013 commented 1 year ago

Notes for recent commit:

Renamed coverage to wig throughout
I agree that implementing a wig parser in the FeatureReader class would be more succinct. However, I found that handling span headers would require modification (i.e. for a wig file with multiple spans). Therefore, I retained and modified the WigReader class with the following:
- The parser addresses the full wig spec
- For each fixed or variable span, creates a FeatureTree with positions (properly accounting for start/end)
- Slices query all spans for matching regions with the FeatureTree query method
I added a new test case that accounts for wig files with fixed and variable span headers.
I had also missed the stream function that allows for reading from remote sources. That has been implemented within the wig parser.

The initial parse will still be slow for very large files - I'm not sure of a way around that; but this should be reasonable for slices. Perhaps a better implementation would be tabix indexed bedgraph files.

My use case is pretty narrow, for which this is perfect. No hard feelings if you'd prefer not to merge.

jrobinso commented 1 year ago

OK looks good overall. I'll try to get to it this week, if I don't ping this conversation, it will just mean something else distracted me.

Bedgraph is a good idea, that could be just simple since really its just a bed3+1 file. Also I should probably add bigwig support. But this is a good start.

jrobinso commented 1 year ago

@sadams2013 Apologies I got slammed by unexpected issues, although I don't know why I don't expect them, but anyway I didn't get to this. On the list for next week, please ping me if it's not merged by Friday.

jrobinso commented 1 year ago

This looks good. Merging.

jrobinso commented 1 year ago

I found 1 bug, there was a logic error that caused the entire wig file to be read on every call to "slice". Fixed now, see commit #99223383 We can never have a "region2" without "region"

sadams2013 commented 1 year ago

Great catch - thank you for fixing and merging!

jrobinso commented 1 year ago

Thank you for the PR. I learned some new Python. You might have noticed its not my first language, I have to google every 3rd line but am getting better. I did not know about the possibility of multiple return values.

jrobinso commented 1 year ago

This is now released to pypi as 1.7.0

sadams2013 commented 1 year ago

I found the code base to be very well written and easy to understand.

Thank you for merging and releasing!

igvteam / igv-reports

Add support for wiggle coverage tracks. #73