Closed tb-fyr closed 1 year ago
thanks for reporting. currently, it's difficult to reproduce this on my end without your data as an example. e.g. different demultiplexing (barcode both ends, mid strand detection...) and basecalling settings (q score cutoffs - used model) might change already the number of reads within a fastq file
Here are our two negative barcodes from one experiment but two separate versions of porecov basecalling (I was getting an error trying to add any more files) barcode01_0-11-0porecov.fastq.gz barcode01_1-1-0porecov.fastq.gz I use default params for demultiplexing using Midnight Kit (mid strand detection at 50) and default basecalling settings with guppy model: dna_r9.4.1_450bps_fast.cfg I've ran these both multiple times and I get the same result. Under 0.11.0 porecov, water doesn't yield any results or show up in summary report. Under 1.1.0, water shows as having omicron. That is where I compared the nanoplot and noticed the added reads.
I have confirmed all these same experiments in EPI2ME and in all our march runs, if I run the data under 0.11.0, our results are clean; but if I run them under 1.1.0 then I get different results. Taking the exact same samples and basecalling them on our MinION MK1C, everything is clean no matter what version of porecov.
Also, here are two screenshots from EPI2ME output for the same experiments the above fastq's come from (again 0.11.0 vs 1.1.0). I noticed with the 1.1.0 result (the second png image), that the distribution of coverage and N distribution seems weird to me, as compared to the clean 0.11.0 version.
Thanks for any help you can provide!
Hi @tb-fyr ,
thanks for digging into this issue. I compared the releases from a code perspective and there are 2 changes that should affect this between the versions.
--minLength 500 --maxLength 1500
--detect_mid_strand_barcodes --min_score_mid_barcodes 50
Code Source Comparision: https://github.com/replikation/poreCov/compare/0.11.0...1.1.0
Hello, Adding the min and max length did clear up our negatives as expected and confirmed in our other analyses with same data so it appears that worked. However, it did result in less samples being recognized it appears. Our pass/fail QC summary was out of 80 samples, rather than 96 on our original 0.11.0 run. Also, out of curiosity what was the reasoning for changing read min length to 100 for all schemes?
I hope the mid strand detection parameter wouldn't be assigning wrong barcodes, as this is recommended setting from Oxford nanopore, however, during my troubleshooting I did look at the barcode summary for our negatives once and was identifying a few barcodes that weren't the negative barcodes (out of like a million reads though). I've heard barcode hopping is possible with oxford tech but that shouldn't be enough reads to change the result, I would hope...
Also, just to confirm my understanding, what determines that a sample doesn't get added to the report summary? I've been assuming it just didn't pass all QC filters? For example, you have 48 samples but the report summary shows 35/40.
As the whole "covid assembly" is just based around mapping we experienced a few false-positive samples due to barcode hopping. We discussed this in our manuscript too (we called it "sample or barcode bleeding") mostly attributed a barcode pool with at least one sample being overloaded (thus lots of "free" DNA) and at least one sample completely underloaded with DNA (lots of free barcodes). As 20x times coverage on the small genome is enough, a few thousand reads already get you a signal. The min read length reduction was included as was increasing for some experiments the coverage quite noticeable and thus improve genome reconstruction and subsequently, way more strains being recognized or "QC-valid" based on completeness. It seemed to improve results more generally speaking as i got quite a few emails about this :)
In the latest porecov version a read screen method is included via --screen_reads
this tries to determine the pangolineage of every read. (its by default of as it takes some hours per sample). But you could use that to investigate how "clean" the barcodes you are using are. btw since the tool was developed for accurate short reads expect some noise depending on the accuracy of the basecaller you are using. for sup basecalled reads we identify other sars-cov lineages with around 0 - 1%. a "clean" sample should be at least 95 % of a lineage. results look like this:
sample variant_group proportion mean std_error
SARSCoV2_filtered A.23.1 0 0.0 0.0
SARSCoV2_filtered AV.1 0.02107885561393124 0.02151797843491451 0.005105415907809195
SARSCoV2_filtered Alpha_B.1.1.7 0.008255186842329024 0.008313999542467122 0.0026089935944567525
SARSCoV2_filtered B.1 4.677849064695906e-10 8.876766030813688e-10 1.2800009054288062e-09
SARSCoV2_filtered B.1.1 1.1621285428771163e-10 5.754328413083417e-10 9.877091767211538e-10
SARSCoV2_filtered B.1.1.318 0 0.0 0.0
SARSCoV2_filtered B.1.177 0.7024952547613489 0.7044517999333074 0.020328715359556118
SARSCoV2_filtered B.1.617.3 0 0.0 0.0
SARSCoV2_filtered B.1.623 0 0.0 0.0
SARSCoV2_filtered Beta_B.1.351 0.2556522779374451 0.2536694420134167 0.021667675821820425
SARSCoV2_filtered Delta_B.1.617.2 0.0004445407879727177 0.00048448068661852194 0.0004949322336048902
SARSCoV2_filtered Epsilon_B.1.427_429 0 0.0 0.0
SARSCoV2_filtered Eta_B.1.525 0 2.0539993885939775e-11 9.01935044212674e-11
SARSCoV2_filtered Gamma_P.1 0.0018766985950762869 0.0016466757712217091 0.002038137642675043
SARSCoV2_filtered Iota_B.1.526 0 3.659593917460922e-10 6.485911094918895e-10
SARSCoV2_filtered Kappa_B.1.617.1 0 7.167376434181761e-10 1.0773584061104731e-09
SARSCoV2_filtered Lambda_C.37 0 0.0 0.0
SARSCoV2_filtered Mu_B.1.621 0.0008949069846917494 0.0008868070308837934 0.0008862490897717564
SARSCoV2_filtered Omicron_BA.1 0 3.184555144123045e-05 0.00022817950467739884
SARSCoV2_filtered Omicron_BA.2 0.0038189007877782737 0.003559979687520783 0.0011243997285608296
SARSCoV2_filtered Omicron_BA.3 0 9.689481363893597e-05 0.0005993846545283763
SARSCoV2_filtered Theta_P.3 0.001851423019144237 0.0021062543590370003 0.0019484222501466196
SARSCoV2_filtered Zeta_P.2 0.0036319880617552086 0.0032338856626401786 0.0028578754111950678
e.g. this sample has B.1.177 70% and Beta_B.1.351 25% might help to checkout whats going on in a sample.
I've been having results lately that make my negatives appear to be contaminated (i.e., yield lineages). However, when I run these same samples through basecalling on our MinION MK1C, they are all clear. Further, I discovered that the filtering is adding reads to the barcodes. I checked barcode01 which is one of our negatives, and when I run porecov with barcode01.fastq.gz and barcode01-filtered.fastq.gz, the first one fails (as expected) while the latter yields a result. I noticed in the NanoPlot html (attached) that the filtered read added 30K reads to it somehow.
Can anyone else confirm this with their fastq files?
![barcode1_nanoplot](https://user-images.githubusercontent.com/101884232/159032762-48433ae7-3994-45aa-ac3d-40da3a19256f.png)
Command used
paste here the command used that created the error, e.g.
nextflow run replikation/poreCov -r 1.1.0 -profile local,docker \ --update --primerV V1200 \ --fast5 /pathtoourfast5dir/ \ --guppy_model dna_r9.4.1_450bps_fast.cfg \ --rapid \ --update \ --krakendb /pathtosetkrakendb/ \ --databases /pathtosetdatabases/ \ --work-Dir /pathtosetworkdir/ \ --memory 64 \ --cores 24 \ --medaka_model r941_min_fast_g507 \ --output /pathtooutputdir/ \ --samples /pathtosamples.csv \ --output results