Closed gevro closed 5 months ago
It actually is not a MNV record, it is a SNV, only the first base is edited. What the program seems not to do is to split multiallelic sites, atomize and normalize in the same step. That could be improved. As a workaround for now, it is possible to stream through another round of normalization which will clean up cases like this.
Thanks. Yes running first bcftools norm -m -both and then after that a separate bcftools norm -a fixes it. There is some bug that when using norm -a -m -both, it doesn't atomize that variant.
Hi, After trying the serial bcftools norm -m -both -> then separate bcftools norm -a, I'm finding another weird edge case:
Original record:
chr1 429024 . CGCGGAGG C,G 27.9 PASS . GT:GQ:DP:AD:VAF:PL 1/2:26:34:3,20,10:0.588235,0.294118:27,34,34,34,0,47
After bcftools norm -m -both
chr1 429024 . CGCGGAGG C 27.9 PASS . GT:GQ:DP:AD:VAF:PL 1/0:26:34:3,20:0.588235:27,34,34
chr1 429024 . CGCGGAGG G 27.9 PASS . GT:GQ:DP:AD:VAF:PL 0/1:26:34:3,10:0.294118:27,34,47
After bcftools norm -m -both | bcftools norm -a:
chr1 429024 . CGCGGAGG C 27.9 PASS . GT:GQ:DP:AD:VAF:PL 1/0:26:34:3,20:0.588235:27,34,34
chr1 429024 . C G,* 27.9 PASS . GT:GQ:DP:AD:VAF:PL 0/2:26:34:3,10,.:0.294118,.:27,34,47,.,.,.
chr1 429024 . CGCGGAGG C,* 27.9 PASS . GT:GQ:DP:AD:VAF:PL 0/2:26:34:3,10,.:0.294118,.:27,34,47,.,.,.
=> Note, I don't understand why 'norm -a' has atomized the two variants produced by 'norm -m -both'? They are two simple deletions. They should not have been atomized. I think this is the source of the bug. 'norm -a' is acting on variants it should not be acting on.
After bcftools norm -m -both | bcftools norm -m -both -a:
chr1 429024 . C G 27.9 PASS . GT:GQ:DP:AD:VAF:PL 0/0:26:34:3,10:0.294118:27,34,47
chr1 429024 . C * 27.9 PASS . GT:GQ:DP:AD:VAF:PL 1/1:26:34:3,.:.:27,.,.
chr1 429024 . CGCGGAGG C 27.9 PASS . GT:GQ:DP:AD:VAF:PL 1/1:26:34:3,30:0.588235:27,34,34
chr1 429024 . CGCGGAGG * 27.9 PASS . GT:GQ:DP:AD:VAF:PL 0/0:26:34:3,.:.:27,.,.
Note that bcftools v1.14 seems to process these differently. After bcftools norm -m -both | bcftools norm -a:
chr1 429024 . CGCGGAGG C 27.9 PASS . GT:GQ:DP:AD:VAF:PL 1/0:26:34:3,20:0.588235:27,34,34
chr1 429024 . CGCGGAGG * 27.9 PASS . GT:GQ:DP:AD:VAF:PL 0/1:26:34:3,.:.:27,.,.
chr1 429024 . CGCGGAGG G 27.9 PASS . GT:GQ:DP:AD:VAF:PL 0/1:26:34:3,10:0.294118:27,34,47
chr1 429024 . CGCGGAGG * 27.9 PASS . GT:GQ:DP:AD:VAF:PL 1/0:26:34:3,.:.:27,.,.
In fact, I think the final output of bcftools v1.18 shown above is not correct and v1.14 is correct? Likely there is a bug somewhere.
Hello, I'm curious if this issue is going to be fixed at some point? Thanks!
This is actually not a bug. The second command sees two variants, a simple deletion CGCGGAGG>C
and a combined event CGCGGAGG>G
. The latter is interpreted as a SNV C>G
combined with the deletion CGCGGAGG>C
, which is not incorrect.
This case would not happen if you provided the reference sequence via the -f
option. Then the first command would produce
chr1 429023 xCGCGGAG x
chr1 429024 CGCGGAGG C
and the atomization would end with that.
Thanks - makes sense. However, there is still the issue where running the -m -both and -a commands in the same bcftools norm 'bcftools norm -a -m -both' doesn't produce the right output. See original post above. Is there a fix planned for that?
Just to illustrate the issue with bcftools 1.19 - trying to do -m -both and -a in the same command doesn't seem to work regardless of whether -f fasta is specified.
Original record:
chr1 429024 . CGCGGAGG C,G 27.9 PASS . GT:GQ:DP:AD:VAF:PL 1/2:26:34:3,20,10:0.588235,0.294118:27,34,34,34,0,47
bcftools norm -m -both without -f fasta
chr1 429024 . CGCGGAGG C 27.9 PASS . GT:GQ:DP:AD:VAF:PL 1/0:26:34:3,20:0.588235:27,34,34
chr1 429024 . CGCGGAGG G 27.9 PASS . GT:GQ:DP:AD:VAF:PL 0/1:26:34:3,10:0.294118:27,34,47
bcftools norm -m -both with -f fasta
chr1 429024 . CGCGGAGG C 27.9 PASS . GT:GQ:DP:AD:VAF:PL 1/0:26:34:3,20:0.588235:27,34,34
bcftools norm -m -both -a (all in one command) without -f fasta
chr1 429024 . C G 27.9 PASS . GT:GQ:DP:AD:VAF:PL 0/0:26:34:3,10:0.294118:27,34,47
chr1 429024 . C * 27.9 PASS . GT:GQ:DP:AD:VAF:PL 1/1:26:34:3,.:.:27,.,.
chr1 429024 . CGCGGAGG C 27.9 PASS . GT:GQ:DP:AD:VAF:PL 1/1:26:34:3,30:0.588235:27,34,34
chr1 429024 . CGCGGAGG * 27.9 PASS . GT:GQ:DP:AD:VAF:PL 0/0:26:34:3,.:.:27,.,.
bcftools norm -m -both -a (all in one command) with -f fasta
chr1 429024 . C G 27.9 PASS . GT:GQ:DP:AD:VAF:PL 0/0:26:34:3,10:0.294118:27,34,47
chr1 429024 . C * 27.9 PASS . GT:GQ:DP:AD:VAF:PL 1/1:26:34:3,.:.:27,.,.
chr1 429024 . CGCGGAGG C 27.9 PASS . GT:GQ:DP:AD:VAF:PL 1/1:26:34:3,30:0.588235:27,34,34
chr1 429024 . CGCGGAGG * 27.9 PASS . GT:GQ:DP:AD:VAF:PL 0/0:26:34:3,.:.:27,.,.
bcftools norm -m -both without -f fasta | bcftools norm -a
chr1 429024 . CGCGGAGG C 27.9 PASS . GT:GQ:DP:AD:VAF:PL 1/0:26:34:3,20:0.588235:27,34,34
chr1 429024 . C G,* 27.9 PASS . GT:GQ:DP:AD:VAF:PL 0/2:26:34:3,10,.:0.294118,.:27,34,47,.,.,.
bcftools norm -m -both with -f fasta | bcftools norm -a
chr1 429024 . CGCGGAGG C 27.9 PASS . GT:GQ:DP:AD:VAF:PL 1/0:26:34:3,20:0.588235:27,34,34
bcftools norm -m -both without -f fasta | bcftools norm -a with fasta
chr1 429024 . CGCGGAGG C 27.9 PASS . GT:GQ:DP:AD:VAF:PL 1/0:26:34:3,20:0.588235:27,34,34
chr1 429024 . C G,* 27.9 PASS . GT:GQ:DP:AD:VAF:PL 0/2:26:34:3,10,.:0.294118,.:27,34,47,.,.,.
chr1 429024 . CGCGGAGG C,* 27.9 PASS . GT:GQ:DP:AD:VAF:PL 0/2:26:34:3,10,.:0.294118,.:27,34,47,.,.,.
bcftools norm -m -both with -f fasta | bcftools norm -a with fasta
chr1 429024 . CGCGGAGG C 27.9 PASS . GT:GQ:DP:AD:VAF:PL 1/0:26:34:3,20:0.588235:27,34,34
Just to further help, I am including a more organized summary of all the possible norm commands, with or without -f fasta, for the above two examples. You can see that something weird happens when -m -both and -a are both used in the same command, as opposed to separate commands.
Regardless, it does seem that running norm -m -both -f fasta only (, i.e. without norm -a) produces the most sensible result for these two examples.
#Original records
chr1 7338026 AAAAT A,ATATAAAT 403.02
chr1 7338028 AAT *,TAT 533.04
#norm -m -both
chr1 7338026 AAAAT A 403.02
chr1 7338026 AAAAT ATATAAAT 403.02
chr1 7338028 AAT * 533.04
chr1 7338028 AAT TAT 533.04
#norm -m -both -f fasta
chr1 7338026 AAAAT A 403.02
chr1 7338026 A ATAT 403.02
chr1 7338028 AAT * 533.04
chr1 7338028 A T 533.04
#norm -m -both -a
chr1 7338026 A ATAT 403.02
chr1 7338026 A * 403.02
chr1 7338026 AAAAT A 403.02
chr1 7338026 AAAAT * 403.02
chr1 7338028 AAT * 533.04
chr1 7338028 AAT TAT 533.04
#norm -m -both -a -f fasta
chr1 7338026 A ATAT 403.02
chr1 7338026 A * 403.02
chr1 7338026 AAAAT A 403.02
chr1 7338026 AAAAT * 403.02
chr1 7338028 AAT * 533.04
chr1 7338028 A T 533.04
#norm -m -both | norm -a
chr1 7338026 AAAAT A 403.02
chr1 7338026 A ATAT 403.02
chr1 7338028 AAT * 533.04
chr1 7338028 A T 533.04
#norm -m -both -f fasta | norm -a
chr1 7338026 AAAAT A 403.02
chr1 7338026 A ATAT 403.02
chr1 7338028 AAT * 533.04
chr1 7338028 A T 533.04
#norm -m -both | norm -a -f fasta
chr1 7338026 AAAAT A 403.02
chr1 7338026 A ATAT 403.02
chr1 7338028 AAT * 533.04
chr1 7338028 A T 533.04
#norm -m -both -f fasta | norm -a -f fasta
chr1 7338026 AAAAT A 403.02
chr1 7338026 A ATAT 403.02
chr1 7338028 AAT * 533.04
chr1 7338028 A T 533.04
SUMMARY of Example 1 Group 1 results - All these produce the same results:
=> Adding -f fasta to -m -both gives the same result as adding norm -a as a separate command with or without -f fasta
Group 2 results: Running -m -both -a (i.e. in the same command), regardless of -f fasta, splits this: chr1 7338026 AAAAT A,ATATAAAT 403.02
into 4 records: chr1 7338026 A ATAT 403.02 chr1 7338026 A 403.02 chr1 7338026 AAAAT A 403.02 chr1 7338026 AAAAT 403.02
In contrast to Group 1 results: chr1 7338026 AAAAT A 403.02 chr1 7338026 A ATAT 403.02
#Original record
chr1 429024 CGCGGAGG C,G 27.9
#norm -m -both
chr1 429024 CGCGGAGG C 27.9
chr1 429024 CGCGGAGG G 27.9
#norm -m -both -f fasta
chr1 429022 GGCGCGGA G 27.9
chr1 429024 CGCGGAGG C 27.9
#norm -m -both -a
chr1 429024 C G 27.9
chr1 429024 C * 27.9
chr1 429024 CGCGGAGG C 27.9
chr1 429024 CGCGGAGG * 27.9
#norm -m -both -a -f fasta
chr1 429024 C G 27.9
chr1 429024 C * 27.9
chr1 429024 CGCGGAGG C 27.9
chr1 429024 CGCGGAGG * 27.9
#norm -m -both | norm -a
chr1 429024 CGCGGAGG C 27.9
chr1 429024 C G,* 27.9
chr1 429024 CGCGGAGG C,* 27.9
#norm -m -both -f fasta | norm -a
chr1 429022 GGCGCGGA G 27.9
chr1 429024 CGCGGAGG C 27.9
#norm -m -both | norm -a -f fasta
chr1 429024 CGCGGAGG C 27.9
chr1 429024 C G,* 27.9
chr1 429024 CGCGGAGG C,* 27.9
#norm -m -both -f fasta | norm -a -f fasta
chr1 429022 GGCGCGGA G 27.9
chr1 429024 CGCGGAGG C 27.9
SUMMARY of Example 2 Group 1 results - All these produce the same results:
chr1 429022 GGCGCGGA G 27.9 chr1 429024 CGCGGAGG C 27.9
Group 2 results - All these produce the same results
chr1 429024 CGCGGAGG C 27.9 chr1 429024 C G, 27.9 chr1 429024 CGCGGAGG C, 27.9
Group 3 results - All these produce the same results
chr1 429024 C G 27.9 chr1 429024 C 27.9 chr1 429024 CGCGGAGG C 27.9 chr1 429024 CGCGGAGG 27.9
The result depends on the order of the -a, -m operations, which is expected. The atomization works on single lines, so it will work differently for a single multiallelic site vs for the same site split into multiple biallelic records. In your examples you should take into consideration also genotype fields, then the spanning *
alleles would start making more sense.
I pushed a commit which changes the default order of the operations to "split first, then atomize" which makes the output look simpler and "nicer" for a typical user, however, not more correct. The original behavior can be enforced by explicit streaming of bcftools norm -a ... | bcftools norm -m
.
Also, looking into this I found bugs in handling Number=A,R,G fields with incorrect number of values. This is now fixed, a missing value .
will be used instead.
Hi, I think I found a bug in 'bcftools norm -a'. Using bcftools v1.18
Original record via bcftools view:
After doing bcftools norm -m -both:
After doing bcftools norm -a -m -both:
As you can see the AAT>TAT MNV should have become A>T. Or perhaps I'm not understanding something fundamental about MNV records.
Thanks