Closed sadams2013 closed 1 year ago
Thanks for the PR! A couple of quick comments. This is from code inspection, I didn't try anything, so I could be missing something.
Wig files are used for all sorts of data, not just coverage, so I suggest renaming "coverage.py" to "wig.py", or something generic.
igv.js supports a "track" line
The wig parser doesn't look complete. See http://rohsdb.cmb.usc.edu/goldenPath/help/wiggle.html
It doesn't appear that "fixed step" is implemented. This option has a start and step size, in addition to an optional "span". It looks like a fixed step wig will be interpreted as a variable step
the end position is not accounted for in "slice". in particular span is not accounted for. This will affect wig features whose start position is < region start but which overlap due to "span".
I don't think you can avoid parsing the wig lines to account for variable & fixed step, and the optional "span" variable. You might find this helpful (skip to line 397): https://github.com/igvteam/igv.js/blob/master/js/feature/decode/ucsc.js.
I think extending "FeatureReader" to handle wig files might be a better approach. To do this you would add a "parse" method in feature.py similar to the igv.js javascript referenced above. See the parse_bed
method for an example The parse_wig
method would return a Feature with chr, start, end, and line => line being the literal line of text parsed. One advantage to this, other than fitting in with the existing framework, is an interval tree is built for the features to support fast slicing. In the implementation here a linear search is done through the features for each chromosome for every region. With a large wig file, and many regions, this could get slow.
Thank you for reviewing!
I agree - the implementation is lacking for several use cases. My approach was a bit myopic in looking to support my own (narrow) use case.
I will work through your suggestions and recommit.
Notes for recent commit:
The initial parse will still be slow for very large files - I'm not sure of a way around that; but this should be reasonable for slices. Perhaps a better implementation would be tabix indexed bedgraph files.
My use case is pretty narrow, for which this is perfect. No hard feelings if you'd prefer not to merge.
OK looks good overall. I'll try to get to it this week, if I don't ping this conversation, it will just mean something else distracted me.
Bedgraph is a good idea, that could be just simple since really its just a bed3+1 file. Also I should probably add bigwig support. But this is a good start.
@sadams2013 Apologies I got slammed by unexpected issues, although I don't know why I don't expect them, but anyway I didn't get to this. On the list for next week, please ping me if it's not merged by Friday.
This looks good. Merging.
I found 1 bug, there was a logic error that caused the entire wig file to be read on every call to "slice". Fixed now, see commit #99223383 We can never have a "region2" without "region"
Great catch - thank you for fixing and merging!
Thank you for the PR. I learned some new Python. You might have noticed its not my first language, I have to google every 3rd line but am getting better. I did not know about the possibility of multiple return values.
This is now released to pypi as 1.7.0
I found the code base to be very well written and easy to understand.
Thank you for merging and releasing!
This adds support for wiggle coverage tracks (WIG files) by adding: