waveygang / wfmash

base-accurate DNA sequence alignments using WFA and mashmap3
MIT License
174 stars 18 forks source link

too many alignments with v0.12.6 #227

Open subwaystation opened 7 months ago

subwaystation commented 7 months ago

Hi developers :)

running

wfmash \
    scerevisiae8.fasta.gz \
    scerevisiae8.fasta.gz \
     \
    --threads 4 \
     \
    -n 7 -s 5000 -p 90.0  -X  -l 25000 -k 19 -H 0.001   -2 30 > scerevisiae8.fasta.gz.paf    

It seems I am getting too many alignments?

out

I would have expected to only get the diagonal output in an all-vs-all setting, right? Or am I missing something?

Best, Simon

AndreaGuarracino commented 7 months ago

Replace -n 7 with -n 1 and you will be happy! The mapper has changed (and is changing) a lot, but not the documentation around the internet...

AndreaGuarracino commented 7 months ago

And also -Y # would help

subwaystation commented 7 months ago

LOL.

subwaystation commented 7 months ago

So PanSN-spec mandatory? Why -n 1? You make nf-core/pangenome unhappy.

subwaystation commented 7 months ago
wfmash scerevisiae8.fasta.gz scerevisiae8.fasta.gz --threads 16  -s 5000 -p 90.0  -X  -l 25000 -k 19 -H 0.001   -2 30 -Y'#' -n 1 > scerevisiae8.fasta.gz.ERIK.paf

./git/wfmash/scripts/paf2dotplot png large scerevisiae8.fasta.gz.ERIK.paf

out

I can't see any difference. What am I missing?

wfmash --version
v0.12.6-0-g7205bf7
ekg commented 7 months ago

Might be good to remove -X. It could be conflicting.

subwaystation commented 7 months ago

this didn't do the trick

subwaystation commented 7 months ago

The input sequences are all just SC888#ChrI and not SC888#1#ChrI. But that should not be the issue here.

Could you please send an example command line where I can just get the diagonal?

bkille commented 7 months ago

So PanSN-spec mandatory? Why -n 1? You make nf-core/pangenome unhappy.

Hi @subwaystation, we're updating wfmash w/ the improvements from MashMap3.

One of these improvements requires that filtering is applied to each pair of assemblies independently. Consider the extreme/contrived scenario where SGDref has 7 copies of a particular region, where each copy is 100% identical to a single region in genome SK1. In the old version of wfmash with -n 7, you might see that region in genome SK1 mapped to each of the 7 copies in genome SGDref, kicking out all of the mappings to the other genomes. To fix this, MashMap3 applies the -n parameter (as well as other filtering parameters) to each pair of assemblies independently.

In a similar fashion, MashMap3 also filters secondary mappings based on the best hit. Again, its important that it can do this independently across each target assembly. For example, if our query was a copy of SGDref, its "top hit" for every region would be 100%, but we shouldn't consider that the "top hit" for each target assembly.

bkille commented 7 months ago

The input sequences are all just SC888#ChrI and not SC888#1#ChrI. But that should not be the issue here.

Could you please send an example command line where I can just get the diagonal?

Also, the organization of the plotting script doesn't seem to be clustering the mappings very well. Here's what happens if I plot the alignments w/ pafplot. There aren't any contig names unfortunately, but at least you can see that things are lining up a bit more like expected.

scerevisiae8-all

>$ wfmash ~/Data/pangenomes/scerevisiae8.fa -m --threads 16  -s 5000 -p 90.0  -X  -l 25000 -k 19 -H 0.001   -2 30 -Y'#' -n 1 > scerevisiae8.fasta.gz.ERIK.paf
[wfmash] Warning: Detected single file all-vs-all mapping with no other options. Consider adding -L, --lower-triangular for efficiency.
[mashmap] Skipping self mappings for single file all-vs-all mapping.
[mashmap] MashMap v3.1.1
[mashmap] Reference = [/home/Users/blk6/Data/pangenomes/scerevisiae8.fa]
[mashmap] Query = [/home/Users/blk6/Data/pangenomes/scerevisiae8.fa]
[mashmap] Kmer size = 19
[mashmap] Sketch size = 298
[mashmap] Segment length = 5000 (read split allowed)
[mashmap] Block length min = 25000
[mashmap] Chaining gap max = 20000
[mashmap] Mappings per segment = 1
[mashmap] Percentage identity threshold = 90%
[mashmap] Skip self mappings
[mashmap] Skipping sequences containing the same prefix based on the delimiter "#"
[mashmap] Hypergeometric filter w/ delta = 0.3 and confidence 0.999
[mashmap] Mapping output file = /dev/stdout
[mashmap] Filter mode = 1 (1 = map, 2 = one-to-one, 3 = none)
[mashmap] Execution threads  = 16
[mashmap::skch::Sketch::build] minmer windows picked from reference = 11171389
[mashmap::skch::Sketch::index] unique minmers = 997130
[mashmap::skch::Sketch::computeFreqHist] Frequency histogram of minmer interval points = (2, 117167) ... (11144, 1)
[mashmap::skch::Sketch::computeFreqHist] With threshold 0.001%, ignore minmers occurring >= 3208 times during lookup.
[wfmash::map] time spent computing the reference index: 6.29692 sec
[mashmap::skch::Map::mapQuery] mapped 100.00% @ 1.37e+07 bp/s elapsed: 00:00:00:07 remain: 00:00:00:00
[mashmap::skch::Map::mapQuery] count of mapped reads = 136, reads qualified for mapping = 136, total input reads = 136, total input bp = 96255507
[wfmash::map] time spent mapping the query: 7.06e+00 sec
[wfmash::map] mapping results saved in: /dev/stdout

>$ pafplot scerevisiae8.fasta.gz.ERIK.paf --png scerevisiae8-all.png --size 2000
subwaystation commented 7 months ago

scerevisiae8 fasta gz paf

It seems I just used the wrong tool for plotting. Above the plot including base pair level alignments. From my observations so far, changing -n for this data set didn't change the mappings according to the pafplot.

subwaystation commented 7 months ago

Thanks @bkille

bkille commented 7 months ago

That's not too surprising that the results are the same. The -n flag only specifies the maximum number of matches for a segment. Unless a region has an alternative mapping of decent ANI and length, there will likely be only one mapping regardless of the maximum number of specified.

subwaystation commented 7 months ago

aah got it!

assuming there would be more mappings in the PAF, do you know how this would affect the pangenome graph construction with seqwish?

subwaystation commented 7 months ago

or would this be resolved with wfmash during the alignment step?

subwaystation commented 7 months ago

@ekg Would the lower triangle mode be sufficient as input for pggb?

baozg commented 7 months ago

Following question, @bkille

I use the sequence without "#" and follow panSN, why the running time is so different?

Name: Chr1, this alignment only have centromere / rDNA mapping in the final paf

wfmash -s 10000 -p 80 -n 1 -k 19 -H 0.001 -Y # -t 36 --hg-filter-ani-diff 30 Col-CC.fa Rabacal-1.fa
5101.41s user 14.62s system 1369% cpu 373.69s total 3126988Kb max memory
--
wfmash -s 10000 -p 90 -n 1 -k 19 -H 0.001 -Y # -t 36 --hg-filter-ani-diff 30 Col-CC.fa Rabacal-1.fa
1700.41s user 13.18s system 1140% cpu 150.23s total 3043676Kb max memory
--
wfmash -s 10000 -p 95 -n 1 -k 19 -H 0.001 -Y # -t 36 --hg-filter-ani-diff 30 Col-CC.fa Rabacal-1.fa
103.47s user 11.58s system 417% cpu 27.54s total 1458372Kb max memory

Name: Col-CC#1#Chr1, Rabacal-1#1#Chr1

wfmash -a -s 10000 -p 80 -n 1 -k 19 -H 0.001 -Y # -t 36 --hg-filter-ani-diff 30 Col-CC.panSN.fa.gz Rabacal-1.panSN.fa.gz
2987.04s user 13.63s system 772% cpu 388.25s total 3944900Kb max memory
--
wfmash -a -s 10000 -p 90 -n 1 -k 19 -H 0.001 -Y # -t 36 --hg-filter-ani-diff 30 Col-CC.panSN.fa.gz Rabacal-1.panSN.fa.gz
1897.45s user 13.90s system 1092% cpu 175.01s total 4115444Kb max memory
--
wfmash -a -s 10000 -p 95 -n 1 -k 19 -H 0.001 -Y # -t 36 --hg-filter-ani-diff 30 Col-CC.panSN.fa.gz Rabacal-1.panSN.fa.gz
625.62s user 12.68s system 901% cpu 70.81s total 3093592Kb max memory
subwaystation commented 7 months ago

@baozg As far as I figured just right now, you need to add --lower-triangle to your command line. You still get the all-vs-all relationship, but just from one side. Sufficient!

subwaystation commented 7 months ago

But such things are not documented... :P @bkille @AndreaGuarracino

ASLeonard commented 7 months ago

The changes are at least verbose in the actual pggb.sh script in the pggb repo. Looking through the blame, --lower-triangle was added 6 months ago in pangenome/pggb@1478789 and the same for the change in meaning for the mapping parameter -n pangenome/pggb@481ec23.

Caveat emptor when building from latest source rather than pinned releases I guess