samtools / bcftools

This is the official development repository for BCFtools. See installation instructions and other documentation here http://samtools.github.io/bcftools/howtos/install.html
http://samtools.github.io/bcftools/
Other
634 stars 241 forks source link

Difference between --atom-overlaps and --multi-overlaps #2142

Closed alexvasilikop closed 2 months ago

alexvasilikop commented 3 months ago

Could someone briefly explain the difference between these two options?

Thanks a lot

pd3 commented 3 months ago

It's a bit confusing, isn't it.

The option --multi-overlaps is tied to the --multiallelics option. This one splits multiallelic sites into biallelic sites, for example the row AA > A,CC will be split into two rows AA>A and AA>CC.

On the other hand, the option --atom-overlaps is ties to the --atomize option. This one breaks a complex allele into its atomic constituents, for example the row AA>CC will be split into two rows A>C and A>C.

In both cases the program has to decide what to put in when heterozygous genotypes are split. For the --atomize case, see the example in the manual page

    # Before atomization:
    100  CC  C,GG   1/2

    # After:
    #   bcftools norm --atomize --atom-overlaps .
    100  C   G      ./1
    100  CC  C      1/.
    101  C   G      ./1

    # After:
    #   bcftools norm --atomize --atom-overlaps '*'
    #   bcftools norm --atomize --atom-overlaps\*
    100  C   G,*    2/1
    100  CC  C,*    1/2
    101  C   G,*    2/1

Analogously for the --multiallelics case.

jkbonfield commented 3 months ago

As an aside, what does . mean in a single GT field? The specification impies everyt GT is element is numeric or every element is ., but doesn't allude to partial dots. If this has a meaning, then it's not explicitly listed in the specification. Logically extending it, I'd guess that ./1 is something the genotype is a mix of something unknown and ALT 1. How is that helpful though? It could be 0/1 or 1/1, yet we know the original data was 1/2 so 1/1 is an impossibility.

Also, why is atomization not using the phasing | notation? We know the two "C G" lines are phased as they started as CC,GG.

alexvasilikop commented 3 months ago

I assume that the dot then means something unknown but by definition different than 1 since otherwise it would not appear as a variant. From what I understand when splitting multiple alleles then default option is to use another alternative allele in the biallelic site and not the reference for comparison.

In any case: it seems that multi-allelics --multi-overlaps is the option I am looking for ( I am looking to perform both splitting (of any type of ALT variants when they exist) into single .ALT variants and subsequently normalization. Would this be the options below. It is not clear to me that if splitting is done then normalization would also be performed.

bcftools sort -O v "!{sample_id}.single_sample.vcf" | \
 bcftools norm -m-any --multi-overlaps 0 --fasta-ref "!{ref_name}" --threads "!{num_threads}" -O v - | \
 bcftools sort -O v -o "${OUTPUT}

Another question, what is the difference between "any" and "both" in the -m option? Is "any' splitting any type of variants whereas "both" only if they are both of the same type?

pd3 commented 2 months ago

Variants are normalized after splitting. (Note there were some improvements in the latest release 1.20 to handle simultaneous atomization and splitting of multiallelics.)

Regarding 'both' and 'any', it is described in the manual page http://samtools.github.io/bcftools/bcftools.html#norm

indels .. only indel records should be split or merged
snps .. only SNV records 
both .. SNVs and indels should be merged separately into two records
any .. this is for merging only, SNVs and indels should be merged into a single record