Open subwaystation opened 7 months ago
Replace -n 7
with -n 1
and you will be happy! The mapper has changed (and is changing) a lot, but not the documentation around the internet...
And also -Y #
would help
LOL.
So PanSN-spec mandatory? Why -n 1
? You make nf-core/pangenome unhappy.
wfmash scerevisiae8.fasta.gz scerevisiae8.fasta.gz --threads 16 -s 5000 -p 90.0 -X -l 25000 -k 19 -H 0.001 -2 30 -Y'#' -n 1 > scerevisiae8.fasta.gz.ERIK.paf
./git/wfmash/scripts/paf2dotplot png large scerevisiae8.fasta.gz.ERIK.paf
I can't see any difference. What am I missing?
wfmash --version
v0.12.6-0-g7205bf7
Might be good to remove -X. It could be conflicting.
this didn't do the trick
The input sequences are all just SC888#ChrI
and not SC888#1#ChrI
. But that should not be the issue here.
Could you please send an example command line where I can just get the diagonal?
So PanSN-spec mandatory? Why
-n 1
? You make nf-core/pangenome unhappy.
Hi @subwaystation, we're updating wfmash w/ the improvements from MashMap3.
One of these improvements requires that filtering is applied to each pair of assemblies independently. Consider the extreme/contrived scenario where SGDref
has 7 copies of a particular region, where each copy is 100% identical to a single region in genome SK1
. In the old version of wfmash with -n 7
, you might see that region in genome SK1
mapped to each of the 7 copies in genome SGDref
, kicking out all of the mappings to the other genomes. To fix this, MashMap3 applies the -n
parameter (as well as other filtering parameters) to each pair of assemblies independently.
In a similar fashion, MashMap3 also filters secondary mappings based on the best hit. Again, its important that it can do this independently across each target assembly. For example, if our query was a copy of SGDref
, its "top hit" for every region would be 100%, but we shouldn't consider that the "top hit" for each target assembly.
The input sequences are all just
SC888#ChrI
and notSC888#1#ChrI
. But that should not be the issue here.Could you please send an example command line where I can just get the diagonal?
Also, the organization of the plotting script doesn't seem to be clustering the mappings very well. Here's what happens if I plot the alignments w/ pafplot. There aren't any contig names unfortunately, but at least you can see that things are lining up a bit more like expected.
>$ wfmash ~/Data/pangenomes/scerevisiae8.fa -m --threads 16 -s 5000 -p 90.0 -X -l 25000 -k 19 -H 0.001 -2 30 -Y'#' -n 1 > scerevisiae8.fasta.gz.ERIK.paf
[wfmash] Warning: Detected single file all-vs-all mapping with no other options. Consider adding -L, --lower-triangular for efficiency.
[mashmap] Skipping self mappings for single file all-vs-all mapping.
[mashmap] MashMap v3.1.1
[mashmap] Reference = [/home/Users/blk6/Data/pangenomes/scerevisiae8.fa]
[mashmap] Query = [/home/Users/blk6/Data/pangenomes/scerevisiae8.fa]
[mashmap] Kmer size = 19
[mashmap] Sketch size = 298
[mashmap] Segment length = 5000 (read split allowed)
[mashmap] Block length min = 25000
[mashmap] Chaining gap max = 20000
[mashmap] Mappings per segment = 1
[mashmap] Percentage identity threshold = 90%
[mashmap] Skip self mappings
[mashmap] Skipping sequences containing the same prefix based on the delimiter "#"
[mashmap] Hypergeometric filter w/ delta = 0.3 and confidence 0.999
[mashmap] Mapping output file = /dev/stdout
[mashmap] Filter mode = 1 (1 = map, 2 = one-to-one, 3 = none)
[mashmap] Execution threads = 16
[mashmap::skch::Sketch::build] minmer windows picked from reference = 11171389
[mashmap::skch::Sketch::index] unique minmers = 997130
[mashmap::skch::Sketch::computeFreqHist] Frequency histogram of minmer interval points = (2, 117167) ... (11144, 1)
[mashmap::skch::Sketch::computeFreqHist] With threshold 0.001%, ignore minmers occurring >= 3208 times during lookup.
[wfmash::map] time spent computing the reference index: 6.29692 sec
[mashmap::skch::Map::mapQuery] mapped 100.00% @ 1.37e+07 bp/s elapsed: 00:00:00:07 remain: 00:00:00:00
[mashmap::skch::Map::mapQuery] count of mapped reads = 136, reads qualified for mapping = 136, total input reads = 136, total input bp = 96255507
[wfmash::map] time spent mapping the query: 7.06e+00 sec
[wfmash::map] mapping results saved in: /dev/stdout
>$ pafplot scerevisiae8.fasta.gz.ERIK.paf --png scerevisiae8-all.png --size 2000
It seems I just used the wrong tool for plotting. Above the plot including base pair level alignments.
From my observations so far, changing -n
for this data set didn't change the mappings according to the pafplot.
Thanks @bkille
That's not too surprising that the results are the same. The -n
flag only specifies the maximum number of matches for a segment. Unless a region has an alternative mapping of decent ANI and length, there will likely be only one mapping regardless of the maximum number of specified.
aah got it!
assuming there would be more mappings in the PAF, do you know how this would affect the pangenome graph construction with seqwish?
or would this be resolved with wfmash during the alignment step?
@ekg Would the lower triangle mode be sufficient as input for pggb?
Following question, @bkille
I use the sequence without "#" and follow panSN, why the running time is so different?
Name: Chr1
, this alignment only have centromere / rDNA mapping in the final paf
wfmash -s 10000 -p 80 -n 1 -k 19 -H 0.001 -Y # -t 36 --hg-filter-ani-diff 30 Col-CC.fa Rabacal-1.fa
5101.41s user 14.62s system 1369% cpu 373.69s total 3126988Kb max memory
--
wfmash -s 10000 -p 90 -n 1 -k 19 -H 0.001 -Y # -t 36 --hg-filter-ani-diff 30 Col-CC.fa Rabacal-1.fa
1700.41s user 13.18s system 1140% cpu 150.23s total 3043676Kb max memory
--
wfmash -s 10000 -p 95 -n 1 -k 19 -H 0.001 -Y # -t 36 --hg-filter-ani-diff 30 Col-CC.fa Rabacal-1.fa
103.47s user 11.58s system 417% cpu 27.54s total 1458372Kb max memory
Name: Col-CC#1#Chr1
, Rabacal-1#1#Chr1
wfmash -a -s 10000 -p 80 -n 1 -k 19 -H 0.001 -Y # -t 36 --hg-filter-ani-diff 30 Col-CC.panSN.fa.gz Rabacal-1.panSN.fa.gz
2987.04s user 13.63s system 772% cpu 388.25s total 3944900Kb max memory
--
wfmash -a -s 10000 -p 90 -n 1 -k 19 -H 0.001 -Y # -t 36 --hg-filter-ani-diff 30 Col-CC.panSN.fa.gz Rabacal-1.panSN.fa.gz
1897.45s user 13.90s system 1092% cpu 175.01s total 4115444Kb max memory
--
wfmash -a -s 10000 -p 95 -n 1 -k 19 -H 0.001 -Y # -t 36 --hg-filter-ani-diff 30 Col-CC.panSN.fa.gz Rabacal-1.panSN.fa.gz
625.62s user 12.68s system 901% cpu 70.81s total 3093592Kb max memory
@baozg As far as I figured just right now, you need to add --lower-triangle
to your command line. You still get the all-vs-all relationship, but just from one side. Sufficient!
But such things are not documented... :P @bkille @AndreaGuarracino
The changes are at least verbose in the actual pggb.sh script in the pggb repo. Looking through the blame, --lower-triangle
was added 6 months ago in pangenome/pggb@1478789 and the same for the change in meaning for the mapping parameter -n
pangenome/pggb@481ec23.
Caveat emptor when building from latest source rather than pinned releases I guess
Hi developers :)
running
It seems I am getting too many alignments?
I would have expected to only get the diagonal output in an all-vs-all setting, right? Or am I missing something?
Best, Simon