replikation / poreCov

SARS-CoV-2 workflow for nanopore sequence data
https://case-group.github.io/
GNU General Public License v3.0
39 stars 16 forks source link

Possible filtering issue #218

Closed tb-fyr closed 1 year ago

tb-fyr commented 2 years ago

I've been having results lately that make my negatives appear to be contaminated (i.e., yield lineages). However, when I run these same samples through basecalling on our MinION MK1C, they are all clear. Further, I discovered that the filtering is adding reads to the barcodes. I checked barcode01 which is one of our negatives, and when I run porecov with barcode01.fastq.gz and barcode01-filtered.fastq.gz, the first one fails (as expected) while the latter yields a result. I noticed in the NanoPlot html (attached) that the filtered read added 30K reads to it somehow.

Can anyone else confirm this with their fastq files? barcod01-filtered-nanoplot barcode1_nanoplot

Command used

paste here the command used that created the error, e.g.

nextflow run replikation/poreCov -r 1.1.0 -profile local,docker \ --update --primerV V1200 \ --fast5 /pathtoourfast5dir/ \ --guppy_model dna_r9.4.1_450bps_fast.cfg \ --rapid \ --update \ --krakendb /pathtosetkrakendb/ \ --databases /pathtosetdatabases/ \ --work-Dir /pathtosetworkdir/ \ --memory 64 \ --cores 24 \ --medaka_model r941_min_fast_g507 \ --output /pathtooutputdir/ \ --samples /pathtosamples.csv \ --output results



## More Context
Computer specs if you are running locally (not in the cloud or a cluster)

> OPERATING_SYSTEM = Ubuntu 20.04 
> THREADS =  48 threads (24 core CPU)
> RAM = 64 GB RAM
>NVIDIA 1080TI GPU 8GB
replikation commented 2 years ago

thanks for reporting. currently, it's difficult to reproduce this on my end without your data as an example. e.g. different demultiplexing (barcode both ends, mid strand detection...) and basecalling settings (q score cutoffs - used model) might change already the number of reads within a fastq file

tb-fyr commented 2 years ago

Here are our two negative barcodes from one experiment but two separate versions of porecov basecalling (I was getting an error trying to add any more files) barcode01_0-11-0porecov.fastq.gz barcode01_1-1-0porecov.fastq.gz I use default params for demultiplexing using Midnight Kit (mid strand detection at 50) and default basecalling settings with guppy model: dna_r9.4.1_450bps_fast.cfg I've ran these both multiple times and I get the same result. Under 0.11.0 porecov, water doesn't yield any results or show up in summary report. Under 1.1.0, water shows as having omicron. That is where I compared the nanoplot and noticed the added reads.

I have confirmed all these same experiments in EPI2ME and in all our march runs, if I run the data under 0.11.0, our results are clean; but if I run them under 1.1.0 then I get different results. Taking the exact same samples and basecalling them on our MinION MK1C, everything is clean no matter what version of porecov.

Also, here are two screenshots from EPI2ME output for the same experiments the above fastq's come from (again 0.11.0 vs 1.1.0). I noticed with the 1.1.0 result (the second png image), that the distribution of coverage and N distribution seems weird to me, as compared to the clean 0.11.0 version. 0-11-0 1 1 0_epi2me

Thanks for any help you can provide!

replikation commented 2 years ago

Hi @tb-fyr ,

thanks for digging into this issue. I compared the releases from a code perspective and there are 2 changes that should affect this between the versions.

min length default settings

barcode mid strand detection

Code Source Comparision: https://github.com/replikation/poreCov/compare/0.11.0...1.1.0

tb-fyr commented 2 years ago

Hello, Adding the min and max length did clear up our negatives as expected and confirmed in our other analyses with same data so it appears that worked. However, it did result in less samples being recognized it appears. Our pass/fail QC summary was out of 80 samples, rather than 96 on our original 0.11.0 run. Also, out of curiosity what was the reasoning for changing read min length to 100 for all schemes?

I hope the mid strand detection parameter wouldn't be assigning wrong barcodes, as this is recommended setting from Oxford nanopore, however, during my troubleshooting I did look at the barcode summary for our negatives once and was identifying a few barcodes that weren't the negative barcodes (out of like a million reads though). I've heard barcode hopping is possible with oxford tech but that shouldn't be enough reads to change the result, I would hope...

Also, just to confirm my understanding, what determines that a sample doesn't get added to the report summary? I've been assuming it just didn't pass all QC filters? For example, you have 48 samples but the report summary shows 35/40.

replikation commented 2 years ago

As the whole "covid assembly" is just based around mapping we experienced a few false-positive samples due to barcode hopping. We discussed this in our manuscript too (we called it "sample or barcode bleeding") mostly attributed a barcode pool with at least one sample being overloaded (thus lots of "free" DNA) and at least one sample completely underloaded with DNA (lots of free barcodes). As 20x times coverage on the small genome is enough, a few thousand reads already get you a signal. The min read length reduction was included as was increasing for some experiments the coverage quite noticeable and thus improve genome reconstruction and subsequently, way more strains being recognized or "QC-valid" based on completeness. It seemed to improve results more generally speaking as i got quite a few emails about this :)

In the latest porecov version a read screen method is included via --screen_reads this tries to determine the pangolineage of every read. (its by default of as it takes some hours per sample). But you could use that to investigate how "clean" the barcodes you are using are. btw since the tool was developed for accurate short reads expect some noise depending on the accuracy of the basecaller you are using. for sup basecalled reads we identify other sars-cov lineages with around 0 - 1%. a "clean" sample should be at least 95 % of a lineage. results look like this:

sample  variant_group   proportion  mean    std_error
SARSCoV2_filtered   A.23.1  0   0.0 0.0
SARSCoV2_filtered   AV.1    0.02107885561393124 0.02151797843491451 0.005105415907809195
SARSCoV2_filtered   Alpha_B.1.1.7   0.008255186842329024    0.008313999542467122    0.0026089935944567525
SARSCoV2_filtered   B.1 4.677849064695906e-10   8.876766030813688e-10   1.2800009054288062e-09
SARSCoV2_filtered   B.1.1   1.1621285428771163e-10  5.754328413083417e-10   9.877091767211538e-10
SARSCoV2_filtered   B.1.1.318   0   0.0 0.0
SARSCoV2_filtered   B.1.177 0.7024952547613489  0.7044517999333074  0.020328715359556118
SARSCoV2_filtered   B.1.617.3   0   0.0 0.0
SARSCoV2_filtered   B.1.623 0   0.0 0.0
SARSCoV2_filtered   Beta_B.1.351    0.2556522779374451  0.2536694420134167  0.021667675821820425
SARSCoV2_filtered   Delta_B.1.617.2 0.0004445407879727177   0.00048448068661852194  0.0004949322336048902
SARSCoV2_filtered   Epsilon_B.1.427_429 0   0.0 0.0
SARSCoV2_filtered   Eta_B.1.525 0   2.0539993885939775e-11  9.01935044212674e-11
SARSCoV2_filtered   Gamma_P.1   0.0018766985950762869   0.0016466757712217091   0.002038137642675043
SARSCoV2_filtered   Iota_B.1.526    0   3.659593917460922e-10   6.485911094918895e-10
SARSCoV2_filtered   Kappa_B.1.617.1 0   7.167376434181761e-10   1.0773584061104731e-09
SARSCoV2_filtered   Lambda_C.37 0   0.0 0.0
SARSCoV2_filtered   Mu_B.1.621  0.0008949069846917494   0.0008868070308837934   0.0008862490897717564
SARSCoV2_filtered   Omicron_BA.1    0   3.184555144123045e-05   0.00022817950467739884
SARSCoV2_filtered   Omicron_BA.2    0.0038189007877782737   0.003559979687520783    0.0011243997285608296
SARSCoV2_filtered   Omicron_BA.3    0   9.689481363893597e-05   0.0005993846545283763
SARSCoV2_filtered   Theta_P.3   0.001851423019144237    0.0021062543590370003   0.0019484222501466196
SARSCoV2_filtered   Zeta_P.2    0.0036319880617552086   0.0032338856626401786   0.0028578754111950678

e.g. this sample has B.1.177 70% and Beta_B.1.351 25% might help to checkout whats going on in a sample.