replikation / poreCov

SARS-CoV-2 workflow for nanopore sequence data
https://case-group.github.io/
GNU General Public License v3.0
39 stars 16 forks source link

Report also VCF files #217

Closed hoelzer closed 1 year ago

hoelzer commented 2 years ago

For many down-stream analyses, it would be neat to also have the VCF file available. I think this should be easy to publish the VCF file that is finally also used to reconstruct the consensus, for example also in the 2.Genomes folder?

replikation commented 2 years ago

this file? https://github.com/replikation/poreCov/blob/e2f2e8f7ee4b7a803ea44f7cf38577cc0681a784/workflows/process/artic.nf#L13

hoelzer commented 2 years ago

I'm checking right now... not 100% sure what's the final VCF from the pipeline, there are at least:

work/f8/8c124a99d5b5b3cc92011d1f3d1b33/IMSSC2-91-2022-00035.fail.vcf
work/f8/8c124a99d5b5b3cc92011d1f3d1b33/IMSSC2-91-2022-00035.merged.vcf.gz
work/f8/8c124a99d5b5b3cc92011d1f3d1b33/IMSSC2-91-2022-00035.pass.vcf

In the pass.vcf I still see duplicated entries for the two different pools, such as

MN908947.3      10447   .       G       A       53.555  PASS    DP=117.0;DPS=53.0,64.0;Pool=2   GT:GQ   1:54
MN908947.3      10447   .       G       A       43.141  PASS    DP=207.0;DPS=101.0,106.0;Pool=1 GT:GQ   1:43
MN908947.3      10449   .       C       A       50.316  PASS    DP=117.0;DPS=53.0,64.0;Pool=2   GT:GQ   1:50
MN908947.3      10449   .       C       A       40.963  PASS    DP=207.0;DPS=101.0,106.0;Pool=1 GT:GQ   1:41

so I thought they might be solved in the merged.vcf.gz but still here we have basically the same

MN908947.3      10447   .       G       A       53.555  PASS    DP=117.0;DPS=53.0,64.0;Pool=2   GT:GQ   1:54
MN908947.3      10447   .       G       A       43.141  PASS    DP=207.0;DPS=101.0,106.0;Pool=1 GT:GQ   1:43
MN908947.3      10449   .       C       A       50.316  PASS    DP=117.0;DPS=53.0,64.0;Pool=2   GT:GQ   1:50
MN908947.3      10449   .       C       A       40.963  PASS    DP=207.0;DPS=101.0,106.0;Pool=1 GT:GQ   1:41

So I'm right now not entirely sure if the pass or merged is the more up to date one we should output for downstream analysis.

hoelzer commented 2 years ago

Is the merged.vcf ever used in the workflow? Or is all later steps then based on the SNP*passed.vcf? If so, I vote for the passed.vcf ;)

replikation commented 2 years ago

not sure is it -> merge (both primer pools) then separating into pass and fail?

hoelzer commented 2 years ago

Ah, this could be also the case. I would say the SNP*passed.vcf as you initially suggested makes sense. But I'm not 100% sure that's the one that is then used to reconstruct the consensus. @MarieLataretu did you look into this more closely when you were grinding through the guts of the ARTIC wf?

MarieLataretu commented 2 years ago

I stumbled across that when I looked at the nanopolish/medaka difference. But I don't remember the file content; I just look through the code again:

merged (https://github.com/artic-network/fieldbioinformatics/blob/master/artic/minion.py#L227-L231) is separated into pass and fail (https://github.com/artic-network/fieldbioinformatics/blob/master/artic/minion.py#L254)

From the ARTIC SOP:

samplename.merged.vcf - all detected variants in VCF format samplename.pass.vcf - detected variants in VCF format passing quality filter samplename.fail.vcf - detected variants in VCF format failing quality filter

SNP is the pass vcf from ARTIC and renamed in the ARTIC wf of poreCov

hoelzer commented 2 years ago

thx @MarieLataretu

alright, then I'd say this is correct and could be used for additional analyses, visualizations, ...

this file?

https://github.com/replikation/poreCov/blob/e2f2e8f7ee4b7a803ea44f7cf38577cc0681a784/workflows/process/artic.nf#L13

Just curious how tools will handle the different information from the two pools.

And just remembered that we also use the pass VCF for CoVarPlot already:

https://github.com/replikation/poreCov/blob/master/workflows/process/artic.nf#L14

MarieLataretu commented 2 years ago

But I'm not 100% sure that's the one that is then used to reconstruct the consensus.

It's like that:

  1. for each pool, do medaka consensus, medaka variant
  2. merge vcfs from all pools
  3. seperated merged into pass and fail
hoelzer commented 2 years ago

ah k! But then after step 3) one still has something like

MN908947.3      10447   .       G       A       53.555  PASS    DP=117.0;DPS=53.0,64.0;Pool=2   GT:GQ   1:54
MN908947.3      10447   .       G       A       43.141  PASS    DP=207.0;DPS=101.0,106.0;Pool=1 GT:GQ   1:43

in the pass.vcf. So two entries for the same position. And I just wonder how the information is then combined into the consensus. E.g. lets say pool 1 says "yes" and pool 2 says "no" : )

MarieLataretu commented 2 years ago

A good question for bcftools consensus :sweat_smile:

The strict parameter from ARTIC would remove such cases from the vcf (I think): https://github.com/artic-network/fieldbioinformatics/blob/master/artic/minion.py#L234

hoelzer commented 2 years ago

:) Ah yeah, that might be. Should be also super rare, but possible (edge-case-martin on investigation).

But we shift, so for this issue it seems fine to publish the SNP*.passed.vcf and to use this in subsequent processes and visuals. Then we will see if we run into issues w/ the two entries for same positions. Thanks for the discussion!