milkschen / chaintools

Utilities for the genomic chain format
MIT License
5 stars 2 forks source link

to_paf.py codes the CIGAR incorrectly #11

Open freeseek opened 11 months ago

freeseek commented 11 months ago

Generate a toy chain file between GRCh38 <-> T2T-CHM13v2.0:

$ echo -e "chain 2089 chr1 248956422 + 123985788 123986987 chr1 248387328 + 123418353 123419553 2482\n340 700 701\n159" > input.chain
$ cat input.chain
chain 2089 chr1 248956422 + 123985788 123986987 chr1 248387328 + 123418353 123419553 2482
340 700 701
159

Assuming I have GRCh38.fa and chm13v2.0.fasta, if I use chain2paf as follows I get:

$ chain2paf -i input.chain -f GRCh38.fa chm13v2.0.fasta
chr1    248387328   123418353   123419553   +   chr1    248956422   123985788   123986987   462 1200    255 cg:Z:5=1X18=1X36=1X9=1X12=1X15=1X38=1X18=2X1=1X47=1X24=1X6=1X44=1X4=1X46=1X1=701I700D2X2=1X4=1X1=1X8=1X22=1X18=2X7=1X2=1X3=1X2=1X6=2X17=1X12=1X13=1X2=2X11=1X8=

while if I use to_paf.py as follows I get:

$ to_paf.py -c input.chain -t GRCh38.fa -q chm13v2.0.fasta
chr1    248387328   123418353   123419553   +   chr1    248956422   123985788   123986987   9   709 255 cg:Z:5=1X18=1X36=1X9=1X12=1X15=1X38=1X18=2X1=1X47=1X24=1X6=1X44=1X4=1X46=1X1=700X1I2X2=1X4=1X1=1X8=1X22=1X18=2X7=1X2=1X3=1X2=1X6=2X17=1X12=1X13=1X2=2X11=1X8=   NM:i:738

The CIGAR outputs are actually different. If I want to highlight the CIGAR differences I get:

$ diff <(chain2paf -i input.chain -f GRCh38.fa chm13v2.0.fasta | cut -f13 | sed 's/\([MIDX=]\)/\1\n/g') <(to_paf.py -c input.chain -t GRCh38.fa -q chm13v2.0.fasta | cut -f13 | sed 's/\([MIDX=]\)/\1\n/g')
32,33c32,33
< 701I
< 700D
---
> 700X
> 1I

This is quite an issue because it causes chain information to be lost:

$ to_paf.py -c input.chain -t GRCh38.fa -q chm13v2.0.fasta > input.paf && paf2chain -i input.paf
chain   255 chr1    248956422   +   123985788   123986987   chr1    248387328   +   123418353   123419553   0
1040    0   1
159

which is different from the original chain file and it introduces a lot of matching bases which might not be matching at all.

Basically using to_paf.py all chain simultaneous gaps in both target and query are removed. This is actually what happened in the GRCh38 <-> T2T-CHM13v2.0 chains now distributed by UCSC and generated as explained here:

nextflow run main.nf --source GRCh38.fa --target chm13v2.0.fasta --outdir dir -profile local --aligner minimap2
python chaintools/src/split.py -c input.chain -o input-split.chain
python chaintools/src/to_paf.py -c input-split.chain -t GRCh38.fa -q chm13v2.0.fasta -o input-split.paf
awk '$1==$6' input-split.paf | rb break-paf --max-size 10000  | rb trim-paf -r | rb invert | rb trim-paf -r | rb invert > out.paf
paf2chain -i out.paf > out.chain
python chaintools/src/invert.py -c out.chain -o out_inverted.chain

This can be quickly verified as follows:

$ wget -O- http://hgdownload.cse.ucsc.edu/goldenPath/hs1/liftOver/hg38-chm13v2.over.chain.gz | zcat | awk 'NF==3 && $2>0 && $3>0' | wc -l
0

Because to_paf.py was used instead of chain2paf all the simultaneous chain gaps in both target and query were removed. I cannot fully understand what happens downstream as a result, but maybe the GRCh38 <-> T2T-CHM13v2.0 chain files should be regenerated as a result.

milkschen commented 11 months ago

Hi @freeseek ,

Thank you for reporting the issue with great details. This is highly appreciated. I will look into this.

milkschen commented 11 months ago

Hi,

chaintools would prefer 1M (whether match or mismatch) than a deletion plus an insertion (i.e., 1I+1D) at a locus. Thus, it is a design to not show a record like 340 700 701.

Therefore, for the grch38 <-> chm13v2 chains, it intended to avoid simultaneous chain gaps. We would be happy to update the chain files if there are clear examples showing improvements using another strategy.

However, I would agree that the default to_paf behavior does not fully translate the chain information. At least an improved doc would avoid the confusion you raised.

freeseek commented 11 months ago

Do you still have the original input.chain file generated by the nf-LO pipeline? I am worried that you might have a situation where a small inversion could have taken place, split.py would not have broken the chain at that locus and to_paf.py would have forced the bases within the inversion to match while another chain with the inverse orientation would have taken care of the small inversion instead. It would be some work to find such an example, but do you have a reason for why something like this wouldn't actually happen?

freeseek commented 11 months ago

Here is an example:

$ wget https://s3-us-west-2.amazonaws.com/human-pangenomics/T2T/CHM13/assemblies/chain/v1_nflo/grch38-chm13v2.chain
$ cat grch38-chm13v2.chain | awk 'NR>=255691 && NR<=255723'
chain   255 chr17   83257441    +   21975454    21992058    chr17   84276897    +   21947772    21960209    0
425 5   0
202 15  0
278 5   0
127 0   73
115 0   120
125 5   0
5   0   5
460 5   0
87  10  0
3219    0   10
27  5   0
438 10  0
391 10  0
46  0   10
150 15  0
466 0   5
1352    65  0
220 0   10
439 0   5
1335    4245    0
59  0   5
66  5   0
34  15  0
128 0   5
230 10  0
679 0   5
370 10  0
67  10  0
107 0   5
253 0   10
68  0   10
191
$ cat grch38-chm13v2.chain | awk 'NR>=255691 && NR<=255723' > input.chain
$ chain2paf -i input.chain -f hg38.fa hs1.fa
chr17   84276897    21947772    21960209    +   chr17   83257441    21975454    21992058    11131   16604   255 cg:Z:425=5D56=1X145=15D278=5D127=73I20X17=1X77=120I12=1X112=5D5=5I32=1X427=5D38=1X48=10D1583=1X155=1X62=1X44=1X297=1X321=1X751=10I27=5D438=10D8=1X80=1X67=1X214=1X18=10D46=10I143=1X6=15D39=1X205=1X116=1X19=1X83=5I1352=65D37X183=10I439=5I23=1X349=553X1=30X1=5X1=2X1=8X1=14X1=7X1=36X1=2X1=12X1=7X1=25X1=3X2=10X1=13X1=10X1=14X1=7X1=15X1=2X1=12X1=1X1=7X1=35X1=3X1=8X1=10X1=3X1=30X1=34X1=6X1=2X1=3X4245D55=1X3=5I9=1X32=1X15=1X7=5D29=1X4=15D31=2X22=2X5=1X24=1X15=1X21=1X2=5I7=1X1=1X13=1X6=1X199=10D38=1X10=1X568=1X60=5I271=1X98=10D67=10D107=5I253=10I68=10I191=
$ chain2paf -i input.chain -f hg38.fa hs1.fa | cut -f13 | sed 's/\([MIDX=]\)/\1\n/g' | awk 'NR>=80 && NR<=149'
5I
23=
1X
349=
553X
1=
30X
1=
5X
1=
2X
1=
8X
1=
14X
1=
7X
1=
36X
1=
2X
1=
12X
1=
7X
1=
25X
1=
3X
2=
10X
1=
13X
1=
10X
1=
14X
1=
7X
1=
15X
1=
2X
1=
12X
1=
1X
1=
7X
1=
35X
1=
3X
1=
8X
1=
10X
1=
3X
1=
30X
1=
34X
1=
6X
1=
2X
1=
3X
4245D

My best guess is that to_paf.py in this particular case replaced the chain gap:

373 5231    986

with the chain gap:

1359    4245    0

And it caused a huge amount of base pairs to mismatch for sequence that is completely unrelated

Of course, this problem is very sporadic, but it is probably widespread in the chain file and it is specific to the chain files processed with to_paf.py and it is not present in the chain files generated by UCSC (e.g. hg38ToHs1.over.chain.gz)

milkschen commented 10 months ago

Thanks for the informative example!

I added an option in to_paf.py (#17) to incorporate your suggestion.

In the CHM13-GRCh{37,38} chain, we used the reduction technique for chain records with a break point (simultaneous indel) <= 1000 base pairs. A chain record with a larger break points were split and considered independently.

maximilianh commented 10 months ago

Does this mean that we should update our chm13/hg38 chain files at UCSC ?

On Mon, Oct 23, 2023 at 06:58 Nae-Chyun Chen @.***> wrote:

Thanks for the informative example!

I added an option in to_paf.py (#17 https://github.com/milkschen/chaintools/pull/17) to incorporate your suggestion.

In the CHM13-GRCh{37,38} chain, we used the reduction technique for chain records with a break point (simultaneous indel) <= 1000 base pairs. A chain record with a larger break points were split and considered independently.

— Reply to this email directly, view it on GitHub https://github.com/milkschen/chaintools/issues/11#issuecomment-1774437536, or unsubscribe https://github.com/notifications/unsubscribe-auth/AACL4TLCAUCV2VOHKQXXUPLYAX2OPAVCNFSM6AAAAAA5KMKDQSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTONZUGQZTONJTGY . You are receiving this because you are subscribed to this thread.Message ID: @.***>

freeseek commented 10 months ago

I do not understand why it would have to be an option with the original behavior being the default behavior. It is quite clear that this is a bug. The previous example causes the GRCh38 region:

chr17   21984176    21985511

which includes a gap in GRCh38, to match the region:

$ echo -e "chr17\t21984176\t21985511" | ./liftOver /dev/stdin hg38-chm13v2.over.chain.gz /dev/stdout /dev/stderr
Reading liftover chains
Mapping coordinates
chr17   21956582    21957917

Using the T2T chain file SNP rs1165854003 with GRCh38 position chr17:21985490 gets mapped to T2T position chr17:21957896. Now this makes no sense at all as the two regions have no homology whatsoever:

$ samtools faidx hg38.fa chr17:21985480-21985500
>chr17:21985480-21985500
GGAATGGAATGGAATGGAATG
$ samtools faidx hs1.fa chr17:21957886-21957906
>chr17:21957886-21957906
atggaatggatttgaatcaac

The behavior of to_paf.py is incorrect. It is not a matter of preference. And my take is that UCSC should not redistribute the T2T chain files until they are fixed