replikation / poreCov

SARS-CoV-2 workflow for nanopore sequence data
https://case-group.github.io/
GNU General Public License v3.0
39 stars 16 forks source link

Classification strategy after initial read filtering #167

Closed hoelzer closed 2 years ago

hoelzer commented 2 years ago

Hi,

this just popped up based on a recent run:

Also, if we just minimp2 the NC reads (unfiltered) to SARS-CoV-2 we hardly get any hit.

Thus, it seems that some "junk" is just assigned to SARS-CoV-2 via kraken2.

It's also a bit confusing that later in the pipeline always the filtered read FASTQs are used but in the report the contamination percentages are based on the unfiltered files.

Thus, proposal: move the kraken2 step after the length filter.

What do you think? I can also help w/ the implementation change if you are fine w/ such an PR.

replikation commented 2 years ago

yeah makes sense to add this i think. feel free to do a PR :)

MarieLataretu commented 2 years ago

Hi there, I'm on it and I was just wondering: kraken2 just classifies the reads, right? The not-SARS-CoV-2 reads are not filtered out and the reconstruction is done on the whole set of reads (fastq_input_ch)? (see https://github.com/replikation/poreCov/blob/master/poreCov.nf#L355-L368)

replikation commented 2 years ago

hi @MarieLataretu read filtering is currently within the artic workflow here

by this process

MarieLataretu commented 2 years ago

Yes, that's the read length filter. But what about the reads that are not classified as SARS-CoV-2. They are also used in the reconstruction, right?

So it's not: kraken2 -> take only SARS-CoV-2 reads -> reconstruction

Edit: I think above was a 'not' missing, sorry

replikation commented 2 years ago
MarieLataretu commented 2 years ago

Okay - just doublechecking :)

hoelzer commented 2 years ago

Yes exactly. It's actually a bit different to other approaches (e.g. covpipe @Marie) where people just continue with the set of sequences that was classified as SC2. But what if the ref db gets outdated (looking at you Delta...)

Honestly, I also think it makes not much difference especially for long reads. The reads are anyway mapped to the (Wuhan) reference then and a 400 nt human read will not map. Thus, we just added kraken2 on top to the general artic workflow to have these classification/contamination counts in the report but still throw all the reads into the artic workflow.

But good you double checked, and thx that you are looking into it already! ;)

On Tue, 9 Nov 2021, 18:11 MarieLataretu, @.***> wrote:

Okay - just doublechecking :)

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/replikation/poreCov/issues/167#issuecomment-964356377, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADN2CZ6AIWTYXVR74WEATPLULFI35ANCNFSM5HSWIFWQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

replikation commented 2 years ago

also kraken2 (as it needs more RAM) can fail in poreCov on low end computers, without compromising the whole pipeline and the user still gets a genome assembly + lineage