Open AaronRuben opened 1 week ago
Thanks for posting the issue, thats definitely not right... I'll take a look and see if I can replicate it. In the meantime, please feel free to post the stderr output of the two commands.
Hi, Thanks for the quick response. I did some debugging on my end, and the duplication of exactly the same alignment was a bug on my end (I had the same contig was included multiple times in the query fasta). However, even on a clean fasta, I do get different alignments depending on whether the reference is gzipped or not. I have the stderr attached.
Thanks again, Aaron compressed.stderr.txt uncompressed.stderr.txt
Ahh! Ok that makes sense. As far as the different results based on the reference goes, it looks MashMap is finding twice as many unique k-mer seeds in the uncompressed reference as opposed to the compressed reference (19,972,584 vs 10,660,334).
Can you confirm that the compressed and uncompressed reference files contain the same data? The following command should not output anything if they are identical:
diff <(gzip -dc hs1.fa.gz) hs1.fa
In the case that they are identical, can you confirm that the issue persists even if you delete any index files (hs1.fa.gz.gzi
, hs1.fa.gz.fai
, and hs1.fa.fai
)? Perhaps the index file was created for a previous version of the file and has not been updated.
Hi,
Re 1: Yes, I could verify that hs1.fa and hs1.fa.gz are identical
Re 2: I created copies of the reference files to and re-ran it mashmap but the issue persists. So I don't think lingering index files are the issue.
Unfortunately, I can't share my data, yet, but I think it might be an issue specific to my input sequence (it's a primary assembly from hifiasm). I tried to align the reference against itself using the uncompressed version as query and once the compressed and once the uncompressed version as reference, respectively. It still generates twice as many unique k-mer seeds in the case of using an uncompressed reference but it generates consistent alignments. I don't know if it's relevant but the k-mer complexities are different. I have attached stderr outputs and alignment outputs. uncompressed.stderr.txt uncompressed.mashmap.txt compressed.mashmap.txt compressed.stderr.txt
And the reference sequence is available from here: https://hgdownload.soe.ucsc.edu/goldenPath/hs1/bigZips/hs1.fa.gz
Thanks, Aaron
From looking at the output logs, the sketch size of the compressed version was 20, but the uncompressed was 40.
I was replicate the issue and actually, it looks like this has been around since MashMap2! Basically, in Section 5 of the original MashMap paper, they show that the sketch size which satisfies their constraints depends on the reference size. Since MashMap2, the raw file size has been used to determine the reference size.
The easiest hack would to just be to decompress the file twice (once to compute the reference length, then again to actually index it), but w/ a large file thats an extra 30 seconds for no good reason. Perhaps we'll just warn users that without a .fai
/.gzi
index, MashMap will have to compute the file size.
In the meantime, you can set the sketch size to 40 manually to ensure consistent results.
As a side note, with --noSplit
, the sketch size is the same for each query sequence, i.e. even a query contig thats 10Mbp will only have 40 sketched k-mers. If you are using --noSplit
, I'd recommend using a larger sketch size (maybe ~100) setting and a --segmentLength
closer to the size of the smallest contig.
Thank you!
Hi,
I am using mashmap v3.1.3 and I noticed that the same alignment was reported multiple times in the output file. For example, here is the output for reference chromosome 21:
I found this odd so I re-ran it. This time I happened to use an uncompressed version of my reference sequence file and I didn't get duplicated alignments, but I got some new alignments and the positions of previously found alignments changed. Here is again the output for reference chr21:
I used these commands:
Any ideas what could trigger such a behavior?
Thanks, Aaron