Closed hoelzer closed 1 year ago
I'm checking right now... not 100% sure what's the final VCF from the pipeline, there are at least:
work/f8/8c124a99d5b5b3cc92011d1f3d1b33/IMSSC2-91-2022-00035.fail.vcf
work/f8/8c124a99d5b5b3cc92011d1f3d1b33/IMSSC2-91-2022-00035.merged.vcf.gz
work/f8/8c124a99d5b5b3cc92011d1f3d1b33/IMSSC2-91-2022-00035.pass.vcf
In the pass.vcf
I still see duplicated entries for the two different pools, such as
MN908947.3 10447 . G A 53.555 PASS DP=117.0;DPS=53.0,64.0;Pool=2 GT:GQ 1:54
MN908947.3 10447 . G A 43.141 PASS DP=207.0;DPS=101.0,106.0;Pool=1 GT:GQ 1:43
MN908947.3 10449 . C A 50.316 PASS DP=117.0;DPS=53.0,64.0;Pool=2 GT:GQ 1:50
MN908947.3 10449 . C A 40.963 PASS DP=207.0;DPS=101.0,106.0;Pool=1 GT:GQ 1:41
so I thought they might be solved in the merged.vcf.gz
but still here we have basically the same
MN908947.3 10447 . G A 53.555 PASS DP=117.0;DPS=53.0,64.0;Pool=2 GT:GQ 1:54
MN908947.3 10447 . G A 43.141 PASS DP=207.0;DPS=101.0,106.0;Pool=1 GT:GQ 1:43
MN908947.3 10449 . C A 50.316 PASS DP=117.0;DPS=53.0,64.0;Pool=2 GT:GQ 1:50
MN908947.3 10449 . C A 40.963 PASS DP=207.0;DPS=101.0,106.0;Pool=1 GT:GQ 1:41
So I'm right now not entirely sure if the pass
or merged
is the more up to date one we should output for downstream analysis.
Is the merged.vcf
ever used in the workflow? Or is all later steps then based on the SNP*passed.vcf
? If so, I vote for the passed.vcf ;)
not sure is it -> merge (both primer pools) then separating into pass and fail?
Ah, this could be also the case. I would say the SNP*passed.vcf
as you initially suggested makes sense. But I'm not 100% sure that's the one that is then used to reconstruct the consensus. @MarieLataretu did you look into this more closely when you were grinding through the guts of the ARTIC wf?
I stumbled across that when I looked at the nanopolish/medaka difference. But I don't remember the file content; I just look through the code again:
merged
(https://github.com/artic-network/fieldbioinformatics/blob/master/artic/minion.py#L227-L231) is separated into pass
and fail
(https://github.com/artic-network/fieldbioinformatics/blob/master/artic/minion.py#L254)
From the ARTIC SOP:
samplename.merged.vcf - all detected variants in VCF format samplename.pass.vcf - detected variants in VCF format passing quality filter samplename.fail.vcf - detected variants in VCF format failing quality filter
SNP
is the pass
vcf from ARTIC and renamed in the ARTIC wf of poreCov
thx @MarieLataretu
alright, then I'd say this is correct and could be used for additional analyses, visualizations, ...
this file?
Just curious how tools will handle the different information from the two pools.
And just remembered that we also use the pass VCF for CoVarPlot already:
https://github.com/replikation/poreCov/blob/master/workflows/process/artic.nf#L14
But I'm not 100% sure that's the one that is then used to reconstruct the consensus.
It's like that:
medaka consensus
, medaka variant
ah k! But then after step 3) one still has something like
MN908947.3 10447 . G A 53.555 PASS DP=117.0;DPS=53.0,64.0;Pool=2 GT:GQ 1:54
MN908947.3 10447 . G A 43.141 PASS DP=207.0;DPS=101.0,106.0;Pool=1 GT:GQ 1:43
in the pass.vcf
. So two entries for the same position. And I just wonder how the information is then combined into the consensus. E.g. lets say pool 1 says "yes" and pool 2 says "no" : )
A good question for bcftools consensus
:sweat_smile:
The strict
parameter from ARTIC would remove such cases from the vcf (I think): https://github.com/artic-network/fieldbioinformatics/blob/master/artic/minion.py#L234
:) Ah yeah, that might be. Should be also super rare, but possible (edge-case-martin on investigation).
But we shift, so for this issue it seems fine to publish the SNP*.passed.vcf
and to use this in subsequent processes and visuals. Then we will see if we run into issues w/ the two entries for same positions. Thanks for the discussion!
For many down-stream analyses, it would be neat to also have the VCF file available. I think this should be easy to publish the VCF file that is finally also used to reconstruct the consensus, for example also in the
2.Genomes
folder?