samtools / bcftools

This is the official development repository for BCFtools. See installation instructions and other documentation here http://samtools.github.io/bcftools/howtos/install.html
http://samtools.github.io/bcftools/
Other
633 stars 240 forks source link

MNV not being atomized by bcftools norm #2014

Closed gevro closed 5 months ago

gevro commented 8 months ago

Hi, I think I found a bug in 'bcftools norm -a'. Using bcftools v1.18

Original record via bcftools view:

chr1    7338028 .   AAT *,TAT   533.04  .   AC=1,1;AF=0.5,0.5;AN=2;DP=25;ExcessHet=3.0103;FS=0;MLEAC=1,1;MLEAF=0.5,0.5;MQ=60;QD=29.61;SOR=1.179 GT:AD:DP:GQ:PL  1/2:0,7,11:18:99:787,543,521,257,0,276

After doing bcftools norm -m -both:

chr1    7338028 .   AAT *   533.04  .   AC=1;AF=0.5;AN=2;DP=25;ExcessHet=3.0103;FS=0;MLEAC=1;MLEAF=0.5;MQ=60;QD=29.61;SOR=1.179 GT:AD:DP:GQ:PL  1/0:0,7:18:99:787,543,521
chr1    7338028 .   AAT TAT 533.04  .   AC=1;AF=0.5;AN=2;DP=25;ExcessHet=3.0103;FS=0;MLEAC=1;MLEAF=0.5;MQ=60;QD=29.61;SOR=1.179 GT:AD:DP:GQ:PL  0/1:0,11:18:99:787,257,276

After doing bcftools norm -a -m -both:

chr1    7338028 .   AAT *   533.04  .   AC=1;AF=0.5;AN=2;DP=25;ExcessHet=3.0103;FS=0;MLEAC=1;MLEAF=0.5;MQ=60;QD=29.61;SOR=1.179 GT:AD:DP:GQ:PL  1/0:0,7:18:99:787,543,521
chr1    7338028 .   AAT TAT 533.04  .   AC=1;AF=0.5;AN=2;DP=25;ExcessHet=3.0103;FS=0;MLEAC=1;MLEAF=0.5;MQ=60;QD=29.61;SOR=1.179 GT:AD:DP:GQ:PL  0/1:0,11:18:99:787,257,276

As you can see the AAT>TAT MNV should have become A>T. Or perhaps I'm not understanding something fundamental about MNV records.

Thanks

pd3 commented 8 months ago

It actually is not a MNV record, it is a SNV, only the first base is edited. What the program seems not to do is to split multiallelic sites, atomize and normalize in the same step. That could be improved. As a workaround for now, it is possible to stream through another round of normalization which will clean up cases like this.

gevro commented 8 months ago

Thanks. Yes running first bcftools norm -m -both and then after that a separate bcftools norm -a fixes it. There is some bug that when using norm -a -m -both, it doesn't atomize that variant.

gevro commented 8 months ago

Hi, After trying the serial bcftools norm -m -both -> then separate bcftools norm -a, I'm finding another weird edge case:

Original record:

chr1    429024  .   CGCGGAGG    C,G 27.9    PASS    .   GT:GQ:DP:AD:VAF:PL  1/2:26:34:3,20,10:0.588235,0.294118:27,34,34,34,0,47

After bcftools norm -m -both

chr1    429024  .   CGCGGAGG    C   27.9    PASS    .   GT:GQ:DP:AD:VAF:PL  1/0:26:34:3,20:0.588235:27,34,34
chr1    429024  .   CGCGGAGG    G   27.9    PASS    .   GT:GQ:DP:AD:VAF:PL  0/1:26:34:3,10:0.294118:27,34,47

After bcftools norm -m -both | bcftools norm -a:

chr1    429024  .   CGCGGAGG    C   27.9    PASS    .   GT:GQ:DP:AD:VAF:PL  1/0:26:34:3,20:0.588235:27,34,34
chr1    429024  .   C   G,* 27.9    PASS    .   GT:GQ:DP:AD:VAF:PL  0/2:26:34:3,10,.:0.294118,.:27,34,47,.,.,.
chr1    429024  .   CGCGGAGG    C,* 27.9    PASS    .   GT:GQ:DP:AD:VAF:PL  0/2:26:34:3,10,.:0.294118,.:27,34,47,.,.,.

=> Note, I don't understand why 'norm -a' has atomized the two variants produced by 'norm -m -both'? They are two simple deletions. They should not have been atomized. I think this is the source of the bug. 'norm -a' is acting on variants it should not be acting on.

After bcftools norm -m -both | bcftools norm -m -both -a:

chr1    429024  .   C   G   27.9    PASS    .   GT:GQ:DP:AD:VAF:PL  0/0:26:34:3,10:0.294118:27,34,47
chr1    429024  .   C   *   27.9    PASS    .   GT:GQ:DP:AD:VAF:PL  1/1:26:34:3,.:.:27,.,.
chr1    429024  .   CGCGGAGG    C   27.9    PASS    .   GT:GQ:DP:AD:VAF:PL  1/1:26:34:3,30:0.588235:27,34,34
chr1    429024  .   CGCGGAGG    *   27.9    PASS    .   GT:GQ:DP:AD:VAF:PL  0/0:26:34:3,.:.:27,.,.

Note that bcftools v1.14 seems to process these differently. After bcftools norm -m -both | bcftools norm -a:

chr1    429024  .   CGCGGAGG    C   27.9    PASS    .   GT:GQ:DP:AD:VAF:PL  1/0:26:34:3,20:0.588235:27,34,34
chr1    429024  .   CGCGGAGG    *   27.9    PASS    .   GT:GQ:DP:AD:VAF:PL  0/1:26:34:3,.:.:27,.,.
chr1    429024  .   CGCGGAGG    G   27.9    PASS    .   GT:GQ:DP:AD:VAF:PL  0/1:26:34:3,10:0.294118:27,34,47
chr1    429024  .   CGCGGAGG    *   27.9    PASS    .   GT:GQ:DP:AD:VAF:PL  1/0:26:34:3,.:.:27,.,.

In fact, I think the final output of bcftools v1.18 shown above is not correct and v1.14 is correct? Likely there is a bug somewhere.

gevro commented 6 months ago

Hello, I'm curious if this issue is going to be fixed at some point? Thanks!

pd3 commented 6 months ago

This is actually not a bug. The second command sees two variants, a simple deletion CGCGGAGG>C and a combined event CGCGGAGG>G. The latter is interpreted as a SNV C>G combined with the deletion CGCGGAGG>C, which is not incorrect.

This case would not happen if you provided the reference sequence via the -f option. Then the first command would produce

chr1 429023 xCGCGGAG    x   
chr1 429024 CGCGGAGG    C   

and the atomization would end with that.

gevro commented 6 months ago

Thanks - makes sense. However, there is still the issue where running the -m -both and -a commands in the same bcftools norm 'bcftools norm -a -m -both' doesn't produce the right output. See original post above. Is there a fix planned for that?

gevro commented 6 months ago

Just to illustrate the issue with bcftools 1.19 - trying to do -m -both and -a in the same command doesn't seem to work regardless of whether -f fasta is specified.

Original record:

chr1    429024  .   CGCGGAGG    C,G 27.9    PASS    .   GT:GQ:DP:AD:VAF:PL  1/2:26:34:3,20,10:0.588235,0.294118:27,34,34,34,0,47

bcftools norm -m -both without -f fasta

chr1    429024  .   CGCGGAGG    C   27.9    PASS    .   GT:GQ:DP:AD:VAF:PL  1/0:26:34:3,20:0.588235:27,34,34
chr1    429024  .   CGCGGAGG    G   27.9    PASS    .   GT:GQ:DP:AD:VAF:PL  0/1:26:34:3,10:0.294118:27,34,47

bcftools norm -m -both with -f fasta

chr1    429024  .   CGCGGAGG    C   27.9    PASS    .   GT:GQ:DP:AD:VAF:PL  1/0:26:34:3,20:0.588235:27,34,34

bcftools norm -m -both -a (all in one command) without -f fasta

chr1    429024  .   C   G   27.9    PASS    .   GT:GQ:DP:AD:VAF:PL  0/0:26:34:3,10:0.294118:27,34,47
chr1    429024  .   C   *   27.9    PASS    .   GT:GQ:DP:AD:VAF:PL  1/1:26:34:3,.:.:27,.,.
chr1    429024  .   CGCGGAGG    C   27.9    PASS    .   GT:GQ:DP:AD:VAF:PL  1/1:26:34:3,30:0.588235:27,34,34
chr1    429024  .   CGCGGAGG    *   27.9    PASS    .   GT:GQ:DP:AD:VAF:PL  0/0:26:34:3,.:.:27,.,.

bcftools norm -m -both -a (all in one command) with -f fasta

chr1    429024  .   C   G   27.9    PASS    .   GT:GQ:DP:AD:VAF:PL  0/0:26:34:3,10:0.294118:27,34,47
chr1    429024  .   C   *   27.9    PASS    .   GT:GQ:DP:AD:VAF:PL  1/1:26:34:3,.:.:27,.,.
chr1    429024  .   CGCGGAGG    C   27.9    PASS    .   GT:GQ:DP:AD:VAF:PL  1/1:26:34:3,30:0.588235:27,34,34
chr1    429024  .   CGCGGAGG    *   27.9    PASS    .   GT:GQ:DP:AD:VAF:PL  0/0:26:34:3,.:.:27,.,.

bcftools norm -m -both without -f fasta | bcftools norm -a

chr1    429024  .   CGCGGAGG    C   27.9    PASS    .   GT:GQ:DP:AD:VAF:PL  1/0:26:34:3,20:0.588235:27,34,34
chr1    429024  .   C   G,* 27.9    PASS    .   GT:GQ:DP:AD:VAF:PL  0/2:26:34:3,10,.:0.294118,.:27,34,47,.,.,.

bcftools norm -m -both with -f fasta | bcftools norm -a

chr1    429024  .   CGCGGAGG    C   27.9    PASS    .   GT:GQ:DP:AD:VAF:PL  1/0:26:34:3,20:0.588235:27,34,34

bcftools norm -m -both without -f fasta | bcftools norm -a with fasta

chr1    429024  .   CGCGGAGG    C   27.9    PASS    .   GT:GQ:DP:AD:VAF:PL  1/0:26:34:3,20:0.588235:27,34,34
chr1    429024  .   C   G,* 27.9    PASS    .   GT:GQ:DP:AD:VAF:PL  0/2:26:34:3,10,.:0.294118,.:27,34,47,.,.,.
chr1    429024  .   CGCGGAGG    C,* 27.9    PASS    .   GT:GQ:DP:AD:VAF:PL  0/2:26:34:3,10,.:0.294118,.:27,34,47,.,.,.

bcftools norm -m -both with -f fasta | bcftools norm -a with fasta

chr1    429024  .   CGCGGAGG    C   27.9    PASS    .   GT:GQ:DP:AD:VAF:PL  1/0:26:34:3,20:0.588235:27,34,34
gevro commented 6 months ago

Just to further help, I am including a more organized summary of all the possible norm commands, with or without -f fasta, for the above two examples. You can see that something weird happens when -m -both and -a are both used in the same command, as opposed to separate commands.

Regardless, it does seem that running norm -m -both -f fasta only (, i.e. without norm -a) produces the most sensible result for these two examples.

Example 1:

#Original records

chr1    7338026 AAAAT   A,ATATAAAT  403.02
chr1    7338028 AAT *,TAT   533.04

#norm -m -both
chr1    7338026 AAAAT   A   403.02
chr1    7338026 AAAAT   ATATAAAT    403.02
chr1    7338028 AAT *   533.04
chr1    7338028 AAT TAT 533.04

#norm -m -both -f fasta
chr1    7338026 AAAAT   A   403.02
chr1    7338026 A   ATAT    403.02
chr1    7338028 AAT *   533.04
chr1    7338028 A   T   533.04

#norm -m -both -a
chr1    7338026 A   ATAT    403.02
chr1    7338026 A   *   403.02
chr1    7338026 AAAAT   A   403.02
chr1    7338026 AAAAT   *   403.02
chr1    7338028 AAT *   533.04
chr1    7338028 AAT TAT 533.04

#norm -m -both -a -f fasta
chr1    7338026 A   ATAT    403.02
chr1    7338026 A   *   403.02
chr1    7338026 AAAAT   A   403.02
chr1    7338026 AAAAT   *   403.02
chr1    7338028 AAT *   533.04
chr1    7338028 A   T   533.04

#norm -m -both | norm -a
chr1    7338026 AAAAT   A   403.02
chr1    7338026 A   ATAT    403.02
chr1    7338028 AAT *   533.04
chr1    7338028 A   T   533.04

#norm -m -both -f fasta | norm -a
chr1    7338026 AAAAT   A   403.02
chr1    7338026 A   ATAT    403.02
chr1    7338028 AAT *   533.04
chr1    7338028 A   T   533.04

#norm -m -both | norm -a -f fasta
chr1    7338026 AAAAT   A   403.02
chr1    7338026 A   ATAT    403.02
chr1    7338028 AAT *   533.04
chr1    7338028 A   T   533.04

#norm -m -both -f fasta | norm -a -f fasta
chr1    7338026 AAAAT   A   403.02
chr1    7338026 A   ATAT    403.02
chr1    7338028 AAT *   533.04
chr1    7338028 A   T   533.04

SUMMARY of Example 1 Group 1 results - All these produce the same results:

norm -m -both | norm -a

norm -m -both -f fasta | norm -a

norm -m -both | norm -a -f fasta

norm -m -both -f fasta | norm -a -f fasta

norm -m -both -f fasta

=> Adding -f fasta to -m -both gives the same result as adding norm -a as a separate command with or without -f fasta

Group 2 results: Running -m -both -a (i.e. in the same command), regardless of -f fasta, splits this: chr1 7338026 AAAAT A,ATATAAAT 403.02

into 4 records: chr1 7338026 A ATAT 403.02 chr1 7338026 A 403.02 chr1 7338026 AAAAT A 403.02 chr1 7338026 AAAAT 403.02

In contrast to Group 1 results: chr1 7338026 AAAAT A 403.02 chr1 7338026 A ATAT 403.02

Example 2:

#Original record
chr1    429024  CGCGGAGG    C,G 27.9

#norm -m -both
chr1    429024  CGCGGAGG    C   27.9
chr1    429024  CGCGGAGG    G   27.9

#norm -m -both -f fasta
chr1    429022  GGCGCGGA    G   27.9
chr1    429024  CGCGGAGG    C   27.9

#norm -m -both -a
chr1    429024  C   G   27.9
chr1    429024  C   *   27.9
chr1    429024  CGCGGAGG    C   27.9
chr1    429024  CGCGGAGG    *   27.9

#norm -m -both -a -f fasta
chr1    429024  C   G   27.9
chr1    429024  C   *   27.9
chr1    429024  CGCGGAGG    C   27.9
chr1    429024  CGCGGAGG    *   27.9

#norm -m -both | norm -a
chr1    429024  CGCGGAGG    C   27.9
chr1    429024  C   G,* 27.9
chr1    429024  CGCGGAGG    C,* 27.9

#norm -m -both -f fasta | norm -a
chr1    429022  GGCGCGGA    G   27.9
chr1    429024  CGCGGAGG    C   27.9

#norm -m -both | norm -a -f fasta
chr1    429024  CGCGGAGG    C   27.9
chr1    429024  C   G,* 27.9
chr1    429024  CGCGGAGG    C,* 27.9

#norm -m -both -f fasta | norm -a -f fasta
chr1    429022  GGCGCGGA    G   27.9
chr1    429024  CGCGGAGG    C   27.9

SUMMARY of Example 2 Group 1 results - All these produce the same results:

norm -m -both

norm -m -both -f fasta

norm -m -both -f fasta | norm -a

norm -m -both -f fasta | norm -a -f fasta

chr1 429022 GGCGCGGA G 27.9 chr1 429024 CGCGGAGG C 27.9

Group 2 results - All these produce the same results

norm -m -both | norm -a

norm -m -both | norm -a -f fasta

chr1 429024 CGCGGAGG C 27.9 chr1 429024 C G, 27.9 chr1 429024 CGCGGAGG C, 27.9

Group 3 results - All these produce the same results

norm -m -both -a

norm -m -both -a -f fasta

chr1 429024 C G 27.9 chr1 429024 C 27.9 chr1 429024 CGCGGAGG C 27.9 chr1 429024 CGCGGAGG 27.9

pd3 commented 5 months ago

The result depends on the order of the -a, -m operations, which is expected. The atomization works on single lines, so it will work differently for a single multiallelic site vs for the same site split into multiple biallelic records. In your examples you should take into consideration also genotype fields, then the spanning * alleles would start making more sense.

I pushed a commit which changes the default order of the operations to "split first, then atomize" which makes the output look simpler and "nicer" for a typical user, however, not more correct. The original behavior can be enforced by explicit streaming of bcftools norm -a ... | bcftools norm -m.

Also, looking into this I found bugs in handling Number=A,R,G fields with incorrect number of values. This is now fixed, a missing value . will be used instead.