replikation / poreCov

SARS-CoV-2 workflow for nanopore sequence data
https://case-group.github.io/
GNU General Public License v3.0
39 stars 16 forks source link

IUPAC code #170

Closed hoelzer closed 1 year ago

hoelzer commented 2 years ago

I think we had this discussion before: right now, if we discover a mixed variant call (e.g. 50% reference nucleotide and 50% alternative nucleotide), poreCov (or better said the ARTIC wf) will include an "N" in the consensus.

This is basically fine, but we could also check which mixed nucleotide combination was detected and then add the respective IUPAC character to the consensus:

E.g., let's say the reference has a G but we detect ~50% T. Then we could write a "K" in the consensus, to at least indicate that there is not a C or A.

I think the default ARTIC pipeline does not support that and just writes N? But maybe there was some update at some point.

Here is an example from an actual BAM provided by poreCov and where an "N" was written:

image

Not sure how easy it is to integrate that, bc/ we likely need some cutoffs then when to call a variant a mixed variant.

If we decide to add this, I could ask around who can implement it or at least check if this is feasible within the ARTIC wf.

replikation commented 2 years ago

there is a discussion about this on the artic git i think.

hoelzer commented 2 years ago

You mean here? https://github.com/artic-network/fieldbioinformatics/issues Can not directly find something

That's the correct git, or? Also seems to not have many updates recently.

replikation commented 2 years ago

yes this one

replikation commented 2 years ago

The current approach of artic is (i think) "conservative" so its masking regions as N. i think it might be more useful to eg. extract some of the calls artic makes into another process providing more info on mixed variant calls without compromising the main workflow

oliverdrechsel commented 2 years ago

Covpipe uses bcftools to produce one N masked consensus and one consensus with IUPAC ambiguous. As Artic already provides the first, the latter could be produced also in porecov, couldn't it? One needs vcf with filtered variants. I think this is provided in one of the many 'work' folders. BTW could porecov output the final vcf that was used to generate the consensus?

hoelzer commented 2 years ago

@oliverdrechsel yeah I think the ARTIC wf also uses bcftools internally. But it's not so easy because we need to figure out where the actual consensus generation happens (https://github.com/artic-network/fieldbioinformatics/tree/1.3.0-dev/artic) to then see if we can basically c&p the IUPAC consensus generation from covpipe.

But I also agree that having the final VCF file as an output would be helpful - and from this file, it should be relatively easy to get the IUPAC consensus as well.

Also on a side note: interestingly Nanopolish calls the "T" here instead of the reference "G". I wonder, if the additional integration of the FAST5 signal data improves the variant calling and thus finally leads to calling the "T" instead of the reference allele. If we assume this call is correct. But from the screenshot, we can see that there are reads with the "T".