replikation / poreCov

SARS-CoV-2 workflow for nanopore sequence data
https://case-group.github.io/
GNU General Public License v3.0
39 stars 16 forks source link

Update medaka in artic #250

Closed hoelzer closed 1 year ago

hoelzer commented 1 year ago

Solves #247

WIP - tests need to be run.

hoelzer commented 1 year ago
nextflow run replikation/poreCov -r update-medaka-in-artic -profile slurm,singularity --krakendb GRCh38.p13_SC2_2022-03-01.tar.gz --cachedir /singularity/ --update --fastq_pass fastq_pass --samples samplesheet.csv --primerV V4.1 --medaka_model r941_min_hac_g507 -resume
[87/93c492] process > artic_ncov_wf:artic_medaka (17)                      [100%] 24 of 24 ✔

🤯

hoelzer commented 1 year ago

Works for me also w/ --medaka_model r1041_e82_400bps_sup_g615

Unfortunately, I can not test this branch w/ --primerV V5.3.2_400 - can I somehow merge that in here @replikation @DataSpott ?

Then we could compare here the results between

1) --primerV V5.3.2_400 w/ old ARTIC container and medaka old model 2) --primerV V5.3.2_400 w/ new ARTIC container and medaka model r1041_e82_400bps_sup_g615

hoelzer commented 1 year ago

I injected the matching primer scheme via

--primerV poreCov/data/external_primer_schemes/nCoV-2019/V5.3.2_400/nCoV-2019.primer.bed

and it looks really good.

I get, from a birds eye view, the same mutations as with the old ARTIC container but we're using now medaka 1.7.2 and have access to the new R1041 etc.. models.

So comparing R9 model and now R10 w/ new container via md5 checksums of the final genome FASTAS I only see differences for four samples in this run. For one, the new model repaired a frame shift - perfect. For the other three we will look via pairwise whole genome aln and then report here

hoelzer commented 1 year ago

Update: for the three sequences w/ different md5 checksum (V5 primers, R9 in old container vs R10 in new container), we see that the difference is only a single base.

Either the old container w/ R9 calls an N at a single position instead of the reference/alternative base compared to the new container w/ R10 or vice versa:

image

The below sub-image shows the mapped reads. Apparently, an A should be called instead of the reference G. The old container with R9 model does this, apparently correctly. In contrast, the new container with R10 model seems undecided and inserts an N at the end of the ARTIC workflow.

For the top sub-image it's another position in the genome and the other way around. Old container and R9 calls an N, while new container and R10 calls a base.

hoelzer commented 1 year ago

These are the only differences we discovered so far. Quite minor in my eyes. And bc/ we basecalled this run w/ R10.4.1* model, it makes also fully sense to me to use that model in Medaka. Even though there are slight differences that "look better" with an older model such as R9. However, basecalling w/ R10 model and then analyses with R9 model is also not really justifiable.

replikation commented 1 year ago

@DataSpott if everything is fine on your end we can merge

DataSpott commented 1 year ago

Tested it with one of our routine runs (starting from fastq-pass) and could only find in one sample a difference. this was only one more ambiguous base with the old container compared to the new container (1112 vs 1113 ambiguous bases). So on my end the run worked fine. Tested with commit "db05fa8".