samtools / bcftools

This is the official development repository for BCFtools. See installation instructions and other documentation here http://samtools.github.io/bcftools/howtos/install.html
http://samtools.github.io/bcftools/
Other
672 stars 240 forks source link

bcftools consensus: mark deletions with character #1381

Closed kpj closed 3 years ago

kpj commented 3 years ago

When using bcftools consensus to create a consensus sequence from a VCF file which contains deletions, these deletions do not appear (as expected).

For some applications, it would be preferable to mark the deletions with a character (e.g., -) instead of completely deleting them. This was already discussed (https://github.com/samtools/bcftools/issues/1170), however the described solution makes use of samtools depth and masked regions which does not always work (e.g., deletions can occur at non-zero depths).

Is there a way of directly marking deletions obtained from the VCF file with - in the consensus sequence?

pd3 commented 3 years ago

There is no direct way, but it would not be too difficult to add a new option. Before any effort is spent on this, can you please provide a more detailed motivation for the request? It also makes me wonder, what about insertions, why is such visualization useful for deletions but not for insertions?

kpj commented 3 years ago

In our efforts to combat SARS-CoV-2 in Switzerland, we are reporting sequences to GISAID. As the deletions are highly relevant in the current situation, the visualisation of those is of more importance than insertions. This functionality would be used in our pipeline for the analysis of NGS data of short viral genomes.

jkbonfield commented 3 years ago

It's a bit scary to be working off branches and I'm not recommending it, but this may be possible in the candidate "samtools consensus" command. The aim of this was purely to give consensus from aligned data without needing or indeed using a reference. Ie it's exactly what the data claims, rather than using the data to edit an external ref.

It needs the htslib Methylation branch (not used directly, but the consensus code has that capability so it's a dependency): https://github.com/samtools/htslib/pull/1132

and my Samtools consensus branch: https://github.com/jkbonfield/samtools/tree/consensus

It uses * for gaps.

If this seems to do what you want, I may be able to push it forwards to get something merged in properly. For example the consensus sub-command without the base modification code present.

pd3 commented 3 years ago

I just added new --mark-del, --mark-ins, and --mark-snv options. Please try it out and let me know in case of problems.

kpj commented 3 years ago

@pd3: Thank you so much for the quick implementation! I tried the new options on one of our samples, and the results look as expected!

I noticed that the new implementation of the --mark-snv parameter is very close to another feature which would be useful for us. Would it be possible to also use it like the --mask parameter, i.e. with a bed file which specifies regions to convert to lower/upper case? An elegant solution might be to allow multiple masking parameters with different "actions". That way, the call bcftools consensus --mask low_coverage.bed N --mask special_regions.bed lc ... would replace regions of low coverage with N and convert interesting regions to lower case. Our use case is that we like to mark bases within certain coverage ranges in the output consensus file with lower case letters, while all other bases are in upper case. If preferred, I could open another issue.

pd3 commented 3 years ago

Best if you could open a new issue so that it can be discussed separately. Thank you

kpj commented 3 years ago

Certainly, makes perfect sense!

Regarding this issue, everything seems to work nicely for me.

pd3 commented 3 years ago

Note that I just fixed a small bug in the new code, please make sure you do git pull && make to update

taprs commented 6 months ago

Dear bcftools team,

I would like to clarify the behavior of * ALT alleles with these flags. Would they be marked with the --mark-del character as well? I am asking because I noticed that in presence of multiple ALT alleles with upstream deletions in some samples (like ALT=A,*) the asterisk makes it to the consensus sequence regardless of --mark-del value — is this the expected behavior?