marbl / MashMap

A fast approximate aligner for long DNA sequences
Other
267 stars 39 forks source link

Inconsistent alignments depending on whether reference file is gzipped or not #70

Open AaronRuben opened 1 week ago

AaronRuben commented 1 week ago

Hi,

I am using mashmap v3.1.3 and I noticed that the same alignment was reported multiple times in the output file. For example, here is the output for reference chromosome 21:

ptg000091l  861542  0   861542  +   chr21   45090682    152523  1014065 13  861542  20  id:f:0.990785   kc:f:0.283563
ptg000217l  527266  0   527266  +   chr21   45090682    178973  706239  12  527266  20  id:f:0.990089   kc:f:0.331461
ptg000217l  527266  0   527266  +   chr21   45090682    178973  706239  12  527266  20  id:f:0.990089   kc:f:0.331461
ptg000203l  686331  0   686331  +   chr21   45090682    179063  865394  15  686331  25  id:f:0.996609   kc:f:0.235299
ptg000268l  916773  0   916773  +   chr21   45090682    582998  1499771 3   916773  12  id:f:0.93172    kc:f:0.0635271
ptg000268l  916773  0   916773  +   chr21   45090682    582998  1499771 3   916773  12  id:f:0.93172    kc:f:0.0635271
ptg000268l  916773  0   916773  +   chr21   45090682    582998  1499771 3   916773  12  id:f:0.93172    kc:f:0.0635271
ptg000145l  2623179 0   2623179 +   chr21   45090682    1133648 3756827 12  2623179 19  id:f:0.98662    kc:f:0.255438
ptg000158l  1222257 0   1222257 +   chr21   45090682    1565548 2787805 3   1222257 12  id:f:0.936192   kc:f:0.214653
ptg000158l  1222257 0   1222257 +   chr21   45090682    1565548 2787805 3   1222257 12  id:f:0.936192   kc:f:0.214653
ptg000158l  1222257 0   1222257 +   chr21   45090682    1565548 2787805 3   1222257 12  id:f:0.936192   kc:f:0.214653
ptg000158l  1222257 0   1222257 +   chr21   45090682    1565548 2787805 3   1222257 12  id:f:0.936192   kc:f:0.214653
ptg000158l  1222257 0   1222257 +   chr21   45090682    1565548 2787805 3   1222257 12  id:f:0.936192   kc:f:0.214653
ptg000158l  1222257 0   1222257 +   chr21   45090682    1565548 2787805 3   1222257 12  id:f:0.936192   kc:f:0.214653
ptg000160l  848270  0   848270  +   chr21   45090682    2308763 3157033 8   848270  16  id:f:0.972836   kc:f:0.583853
ptg000160l  848270  0   848270  +   chr21   45090682    2308763 3157033 8   848270  16  id:f:0.972836   kc:f:0.583853
ptg000151l  555938  0   555938  +   chr21   45090682    2459556 3015494 15  555938  22  id:f:0.993434   kc:f:1.23635
ptg000172l  1014256 0   1014256 +   chr21   45090682    2577878 3592134 10  1014256 17  id:f:0.980634   kc:f:0.597447
ptg000215l  318592  0   318592  +   chr21   45090682    5433633 5752225 19  318592  29  id:f:0.998634   kc:f:0.130072
ptg000025l  2865536 0   2865536 +   chr21   45090682    5830375 8695911 10  2865536 17  id:f:0.978886   kc:f:0.474346
ptg000176l  584981  0   584981  -   chr21   45090682    8297841 8882822 9   584981  16  id:f:0.977014   kc:f:0.599697
ptg000176l  584981  0   584981  +   chr21   45090682    8297841 8882822 9   584981  16  id:f:0.977014   kc:f:0.599697
ptg000176l  584981  0   584981  +   chr21   45090682    8297841 8882822 9   584981  16  id:f:0.977014   kc:f:0.599697
ptg000176l  584981  0   584981  +   chr21   45090682    8297841 8882822 9   584981  16  id:f:0.977014   kc:f:0.599697
ptg000093l  3796125 0   3796125 +   chr21   45090682    8542762 12338887    19  3796125 29  id:f:0.998634   kc:f:0.41517
ptg000125l  2794119 0   2794119 +   chr21   45090682    9318318 12112437    18  2794119 25  id:f:0.997158   kc:f:0.524907
ptg000290l  812527  0   812527  +   chr21   45090682    10614047    11426574    4   812527  13  id:f:0.945935   kc:f:0.117817
ptg000293l  206332  0   206332  +   chr21   45090682    10988834    11195166    10  206332  17  id:f:0.978886   kc:f:0.0164015
ptg000292l  229265  0   229265  +   chr21   45090682    11199120    11428385    11  229265  17  id:f:0.982112   kc:f:0.0476858
ptg000033l  27686851    0   27686851    +   chr21   45090682    24117979    45090681    20  27686851    255 id:f:1  kc:f:0.872569
ptg000115l  3037875 0   3037875 +   chr21   45090682    38473000    41510875    18  3037875 25  id:f:0.997158   kc:f:0.865375
ptg000147l  1146249 0   1146249 +   chr21   45090682    42104328    43250577    19  1146249 29  id:f:0.998634   kc:f:1.36605
ptg000047l  2315813 0   2315813 +   chr21   45090682    43904919    45090681    19  2315813 29  id:f:0.998634   kc:f:1.30252

I found this odd so I re-ran it. This time I happened to use an uncompressed version of my reference sequence file and I didn't get duplicated alignments, but I got some new alignments and the positions of previously found alignments changed. Here is again the output for reference chr21:

ptg000091l  861542  0   861542  +   chr21   45090682    173803  1035345 29  861542  22  id:f:0.993222   kc:f:0.33354
ptg000203l  686331  0   686331  +   chr21   45090682    179063  865394  30  686331  22  id:f:0.994209   kc:f:0.315678
ptg000145l  2623179 0   2623179 +   chr21   45090682    1170940 3794119 24  2623179 19  id:f:0.98662    kc:f:0.182862
ptg000151l  555938  0   555938  +   chr21   45090682    2499197 3055135 24  555938  19  id:f:0.98662    kc:f:0.900919
ptg000172l  1014256 0   1014256 +   chr21   45090682    2682656 3696912 27  1014256 20  id:f:0.990289   kc:f:0.642754
ptg000215l  318592  0   318592  +   chr21   45090682    5420070 5738662 38  318592  29  id:f:0.998634   kc:f:0.145981
ptg000025l  2865536 0   2865536 +   chr21   45090682    7220950 10086486    21  2865536 17  id:f:0.981403   kc:f:0.455276
ptg000176l  584981  0   584981  -   chr21   45090682    8381941 8966922 16  584981  16  id:f:0.971897   kc:f:0.557647
ptg000176l  584981  0   584981  +   chr21   45090682    8381941 8966922 16  584981  16  id:f:0.971897   kc:f:0.557647
ptg000176l  584981  0   584981  +   chr21   45090682    8381941 8966922 16  584981  16  id:f:0.971897   kc:f:0.557647
ptg000176l  584981  0   584981  +   chr21   45090682    8381941 8966922 16  584981  16  id:f:0.971897   kc:f:0.557647
ptg000125l  2794119 0   2794119 +   chr21   45090682    9374562 12168681    33  2794119 23  id:f:0.995431   kc:f:0.500231
ptg000293l  206332  0   206332  +   chr21   45090682    11091705    11298037    21  206332  17  id:f:0.980549   kc:f:0.0194184
ptg000292l  229265  0   229265  +   chr21   45090682    11199120    11428385    25  229265  19  id:f:0.986286   kc:f:0.0644872
ptg000033l  27686851    0   27686851    +   chr21   45090682    24229362    45090681    40  27686851    255 id:f:1  kc:f:1.00259
ptg000115l  3037875 0   3037875 +   chr21   45090682    39989460    43027335    37  3037875 27  id:f:0.997911   kc:f:0.951376
ptg000147l  1146249 0   1146249 +   chr21   45090682    42183625    43329874    38  1146249 29  id:f:0.998634   kc:f:1.11024
ptg000047l  2315813 0   2315813 +   chr21   45090682    43904919    45090681    38  2315813 29  id:f:0.998634   kc:f:1.22872

I used these commands:

mashmap --perc_identity 95 --noSplit -r hs1.fa.gz -q hifiasm.bp.unified.fa --threads 32 -o test1.mashmap
gunzip hs1.fa.gz
mashmap --perc_identity 95 --noSplit -r hs1.fa -q hifiasm.bp.unified.fa --threads 32 -o test.mashmap

Any ideas what could trigger such a behavior?

Thanks, Aaron

bkille commented 1 week ago

Thanks for posting the issue, thats definitely not right... I'll take a look and see if I can replicate it. In the meantime, please feel free to post the stderr output of the two commands.

AaronRuben commented 1 week ago

Hi, Thanks for the quick response. I did some debugging on my end, and the duplication of exactly the same alignment was a bug on my end (I had the same contig was included multiple times in the query fasta). However, even on a clean fasta, I do get different alignments depending on whether the reference is gzipped or not. I have the stderr attached.

Thanks again, Aaron compressed.stderr.txt uncompressed.stderr.txt

bkille commented 1 week ago

Ahh! Ok that makes sense. As far as the different results based on the reference goes, it looks MashMap is finding twice as many unique k-mer seeds in the uncompressed reference as opposed to the compressed reference (19,972,584 vs 10,660,334).

  1. Can you confirm that the compressed and uncompressed reference files contain the same data? The following command should not output anything if they are identical:

    diff <(gzip -dc hs1.fa.gz) hs1.fa
  2. In the case that they are identical, can you confirm that the issue persists even if you delete any index files (hs1.fa.gz.gzi, hs1.fa.gz.fai, and hs1.fa.fai)? Perhaps the index file was created for a previous version of the file and has not been updated.

AaronRuben commented 6 days ago

Hi,

Re 1: Yes, I could verify that hs1.fa and hs1.fa.gz are identical

Re 2: I created copies of the reference files to and re-ran it mashmap but the issue persists. So I don't think lingering index files are the issue.

Unfortunately, I can't share my data, yet, but I think it might be an issue specific to my input sequence (it's a primary assembly from hifiasm). I tried to align the reference against itself using the uncompressed version as query and once the compressed and once the uncompressed version as reference, respectively. It still generates twice as many unique k-mer seeds in the case of using an uncompressed reference but it generates consistent alignments. I don't know if it's relevant but the k-mer complexities are different. I have attached stderr outputs and alignment outputs. uncompressed.stderr.txt uncompressed.mashmap.txt compressed.mashmap.txt compressed.stderr.txt

And the reference sequence is available from here: https://hgdownload.soe.ucsc.edu/goldenPath/hs1/bigZips/hs1.fa.gz

Thanks, Aaron

bkille commented 6 days ago

From looking at the output logs, the sketch size of the compressed version was 20, but the uncompressed was 40.

I was replicate the issue and actually, it looks like this has been around since MashMap2! Basically, in Section 5 of the original MashMap paper, they show that the sketch size which satisfies their constraints depends on the reference size. Since MashMap2, the raw file size has been used to determine the reference size.

The easiest hack would to just be to decompress the file twice (once to compute the reference length, then again to actually index it), but w/ a large file thats an extra 30 seconds for no good reason. Perhaps we'll just warn users that without a .fai/.gzi index, MashMap will have to compute the file size.

In the meantime, you can set the sketch size to 40 manually to ensure consistent results.

bkille commented 6 days ago

As a side note, with --noSplit, the sketch size is the same for each query sequence, i.e. even a query contig thats 10Mbp will only have 40 sketched k-mers. If you are using --noSplit, I'd recommend using a larger sketch size (maybe ~100) setting and a --segmentLength closer to the size of the smallest contig.

AaronRuben commented 1 day ago

Thank you!