rhpvorderman / sequali

Fast sequencing data quality metrics
GNU Affero General Public License v3.0
9 stars 0 forks source link

Make sequali paired-end aware #135

Closed rhpvorderman closed 2 months ago

rhpvorderman commented 3 months ago

I have gotten feedback from @marcelm and @Redmar-van-den-Berg that paired-end awareness for sequali would be a positive addition. (See also issue #121 )Just recently I got more than a thousand Illumina paired-end files to analyse for a researcher within our institute. And indeed, not having paired-end awareness is annoying.

In order to solve this problem, the following sub-problems need to be solved:

As usual the technical parts are the least daunting and I already have ideas for that. The report restructuring is going to be challenging. Feedback on that front is welcome. For the HTML part I am currently thinking to follow fastp's example here and publish a combined dupication section and simply output all the other modules once for each read. For the JSON report I have to take MultiQC compatibility in mind. I am still undecided on how to do this best. Feedback is welcome.

Some notes on how I plan to attack the technical problems:

The paired aware parser can be implemented with minimal modifications to the current parser. The current parser parses in blocks of 128KB. It checks this builtin 128KB limit, and enlarges it appropriately to get at least one record. This check can be changed to get at least X records. The plan is to have the R1 parser parse as normal. Get the number of records, and request the R2 parser to return the same number of records.

Paired-end duplication testing is going to be relatively easy. Currently an algorithm is used that avoids the extremities of the sequence as that works best for nanopore. For illumina it is better to take the beginning of R1 and R2 as these present both ends of the sequence and are the highest fidelity parts of the read.

For the insert size metrics I plan to take the first and last 16bp of the forward read and check for presence in the reverse read. By hardcoding the parameters, this can be optimized pretty well, even allowing a few substitution errors.

rhpvorderman commented 2 months ago

Paired-end data supported since sequali 0.8.0. MultiQC integration has been merged. Sequali should be supported from MultiQC 1.22 onwards.