s4hts / HTStream

A high throughput sequence read toolset using a streaming approach facilitated by Linux pipes
https://s4hts.github.io/HTStream/
Apache License 2.0
49 stars 9 forks source link

Length template option 3 #212

Closed joe-angell closed 4 years ago

joe-angell commented 4 years ago

closer to what i had in mind, the counter output is a bit lame, but could be improved.

joe-angell commented 4 years ago

@dstreett what do you think of this approach?

joe-angell commented 4 years ago

@msettles @dstreett @samhunter This turned into a bigger refactor, removed all the dynamic casts (except from the test code because I'm not too concerned about them, even though most are unnecessary), and simplified the interface as much as I could. Made writer_helper a class that uses a visitor pattern to avoid dynamic_cast.

msettles commented 4 years ago

So quick look (avoiding preparing for next weeks workshop) but the changes seem pretty significant. If I'm reading this correctly it generalizes the concept of "read" more, so if a "Read" has size 1 its single-end if size 2 must be paired end, this actually would allow for more possibilities for types of reads, ala PacBio which can have more than 2 reads (here called a subread) per read, but they mean something a little different. An additional annotation to the read could specify if reads in file are "paired" (ends of a fragment) or "subreads" (circular fragment sequencing).

joe-angell commented 4 years ago

So quick look (avoiding preparing for next weeks workshop) but the changes seem pretty significant. If I'm reading this correctly it generalizes the concept of "read" more, so if a "Read" has size 1 its single-end if size 2 must be paired end, this actually would allow for more possibilities for types of reads, ala PacBio which can have more than 2 reads (here called a subread) per read, but they mean something a little different. An additional annotation to the read could specify if reads in file are "paired" (ends of a fragment) or "subreads" (circular fragment sequencing).

I added a new interface to readbase which lets you iterate over the reads by way of a vector. So if you want to filter all reads longer than 100 or whatever that is easy to do, but we still have Single and Paired end reads, that didn't change. Do you think these changes cause problems for implementing PacBio?

msettles commented 4 years ago

No, instead I think we could naturally extend the idea to something like Pacbio_subreads, so have single_end, paired end and subreads. And applying read type relevant processing per app depending on the type of read.

joe-angell commented 4 years ago

That sounds reasonable, I need to read up on how they work but can help with that if you want.