Classification strategy after initial read filtering

hoelzer commented 2 years ago

Hi,

this just popped up based on a recent run:

kraken2 found here ~900 reads assigned to SARS-CoV-2 from a negative control
however, after the read length filtering step only ~10 reads are left in the NC

Also, if we just minimp2 the NC reads (unfiltered) to SARS-CoV-2 we hardly get any hit.

Thus, it seems that some "junk" is just assigned to SARS-CoV-2 via kraken2.

It's also a bit confusing that later in the pipeline always the filtered read FASTQs are used but in the report the contamination percentages are based on the unfiltered files.

Thus, proposal: move the kraken2 step after the length filter.

What do you think? I can also help w/ the implementation change if you are fine w/ such an PR.

replikation commented 2 years ago

yeah makes sense to add this i think. feel free to do a PR :)

MarieLataretu commented 2 years ago

Hi there, I'm on it and I was just wondering: kraken2 just classifies the reads, right? The not-SARS-CoV-2 reads are not filtered out and the reconstruction is done on the whole set of reads (fastq_input_ch)? (see https://github.com/replikation/poreCov/blob/master/poreCov.nf#L355-L368)

replikation commented 2 years ago

hi @MarieLataretu read filtering is currently within the artic workflow here

by this process

MarieLataretu commented 2 years ago

Yes, that's the read length filter. But what about the reads that are not classified as SARS-CoV-2. They are also used in the reconstruction, right?

So it's not: kraken2 -> take only SARS-CoV-2 reads -> reconstruction

Edit: I think above was a 'not' missing, sorry

replikation commented 2 years ago

yes, this was more of a "future proof" decision to avoid that the classifier gets at some point outdated and removes reads that are actually important or viral reads for the assembly.
so that is why we just classify all reads but don't use these results for the genome assembly

MarieLataretu commented 2 years ago

Okay - just doublechecking :)

hoelzer commented 2 years ago

Yes exactly. It's actually a bit different to other approaches (e.g. covpipe @Marie) where people just continue with the set of sequences that was classified as SC2. But what if the ref db gets outdated (looking at you Delta...)

Honestly, I also think it makes not much difference especially for long reads. The reads are anyway mapped to the (Wuhan) reference then and a 400 nt human read will not map. Thus, we just added kraken2 on top to the general artic workflow to have these classification/contamination counts in the report but still throw all the reads into the artic workflow.

But good you double checked, and thx that you are looking into it already! ;)

On Tue, 9 Nov 2021, 18:11 MarieLataretu, @.***> wrote:

Okay - just doublechecking :)

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/replikation/poreCov/issues/167#issuecomment-964356377, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADN2CZ6AIWTYXVR74WEATPLULFI35ANCNFSM5HSWIFWQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

replikation commented 2 years ago

also kraken2 (as it needs more RAM) can fail in poreCov on low end computers, without compromising the whole pipeline and the user still gets a genome assembly + lineage

replikation / poreCov

Classification strategy after initial read filtering #167