Open baozg opened 2 years ago
Apart from the required 12 columns in PAF, syri also requires a column containig cg
tag and corresponding CIGAR string (with =/X
, and not M
; like the output from the --eqx
parameter from minimap2). I would guess that if wfmash_PAF has that, then it would work fine.
wfmash did have cg
tag but the result is very different. I will check the CIGAR.
minimap2 | wfmash |
---|---|
chr09 | ch09 |
83447412 | 69406105 |
76599034 | 6700000 |
81799482 | 6760000 |
+ | + |
chr9 | chr09 |
69406105 | 83447412 |
63131656 | 6809033 |
68327197 | 6862542 |
5194689 | 44010 |
5200998 | 60000 |
60 | 5 |
tp:A:P | gi:f:94.3287 |
mm:i:302 | bi:f:65.6004 |
gn:i:6007 | md:f:94.1134 |
go:i:215 | wt:i:1393 |
cg:Z:117673=12I7732=1I7809=12I56783=14I8890 | pt:i:101 |
aa:i:9498 | |
ap:i:798 | |
cg:Z:48=1X13=1X10=1X8=1X7=1X7=1X4=1X10=1X34=1X63=1X6=1X4=1X1 |
The cigar is in the correct format. What other dependencies are there? Mapping quality is not set the same way.
I assume you mean requirements for the PAF to be read correctly by syri. Syri only uses alignment coordinates and cigar (from which it then calculates alignment identity). So I would guess that this PAF should work fine. Mapping quality values are not used so it is OK if these values are very different.
I use Col-0 and Ler-0 as test, but minimap2 and wfmash paf have substtational difference.
# minimap2 2.24
# wfmash 0.8.2
# syri 1.6
minimap2 -x asm5 -c -t 5 --eqx TAIR10.fa.gz Ler0.fa.gz > minimap2.paf wfmash -t 32 -p 95 -s 1k TAIR10.fa.gz Ler0.fa.gz > wfmash.paf
syri -c minimap2.paf -q Ler0.fa.gz -r TAIR10.fa.gz -F P --prefix minimap2 syri -c wfmash.paf -q Ler0.fa.gz -r TAIR10.fa.gz -F P --prefix wfmash
2. Synteny plot from [paf2dotplot](https://github.com/waveygang/wfmash/blob/master/scripts/paf2dotplot)
<img width="643" alt="image" src="https://user-images.githubusercontent.com/20680150/168740301-8bfbbdae-8a3e-4b99-9dba-cbd6bf5da7da.png">
3. Summary of syri
- minimap2
<img width="445" alt="image" src="https://user-images.githubusercontent.com/20680150/168743789-4751f9d7-0d92-4ee0-a77e-e54080cb10e9.png">
- wfmash (1k)
<img width="447" alt="截屏2022-05-17 下午2 33 36" src="https://user-images.githubusercontent.com/20680150/168744589-c0d23b00-0eba-4106-b641-7c5390c7d5ef.png">
- wfmash (5k)
@baozg is the amount of sequence in alignments measured from the input PAFs this different?
Is there some filtering in syri that might be kicking out alignments?
From basic awk, the amout aligned region of wfmash
is 109Mb, and minimap2
is 108Mb.
@mnshgl0110 Maybe SyRI
have some filtering step to remove repeat region ?
By default, syri filters out alignments with identity <90% and length <100bp in both reference and query. You can turn this filtering off by using the -f
parameter.
There is 14MB difference in the length of not aligning regions, so possible that if there are multiple small or low identity alignments generated by wfmash, then they were filtered out.
Is the identity gap compressed or block/blast identity?
On Tue, May 17, 2022, 14:41 Manish Goel @.***> wrote:
By default, syri filters out alignments with identity <90% and length <100bp in both reference and query. You can turn this filtering off by using the -f parameter.
There is 14MB difference in the length of not aligning regions, so possible that if there are multiple small or low identity alignments generated by wfmash, then they were filtered out.
— Reply to this email directly, view it on GitHub https://github.com/schneebergerlab/syri/issues/144#issuecomment-1128820301, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABDQEJYGPTPT2O53YLQNZLVKOHXZANCNFSM5VUJFQNQ . You are receiving this because you commented.Message ID: @.***>
Expanding... I would not expect alignments <100bp. The minimum possible length should be 5x the segment size so around 5kb or 25kb in these tests.
So to be filtering out so much of the alignment, then it would seem that the identity filter is applied. If that's computed as matches/total alignment block then I could see this being rather strict. For instance a 1kb indel in a 10kb alignment would be filtered out. But if gaps are "compressed" in this calculation (counted as a single character irrespective of length) then something else might be going on.
Hypotheses...
Excellent. I turn off the filtering step of SyRI and wfmash result have much more variants than minimap2. Maybe I can tune the parameters of wfmash for variant calling.
For instance a 1kb indel in a 10kb alignment would be filtered out.
syri would indeed filter out such alignments and based on @baozg's latest results it seems that was the case all along. I will add some warning message to notify when large alignments get filtered out.
@ekg Is it possible to configure wfmash to actually break at such divergent regions and not align these gaps? These large gaps are often result of TE translocation/duplication. Aligning these regions as indels limits selection of alignments corresponding to actual TE movement. As a result, more indels are found and the genetic information that they are translocations/duplications is lost.
You can break PAF on gaps greater than a given length using @mrvollger's rustybam. https://github.com/mrvollger/rustybam
It's interesting to see the other side of this tradeoff. For pangenome graph building we try to obtain the longest syntenic alignments. This can hide some information about translocations and TE movements. Although I do wonder if you wouldn't be able to better identify breakpoints around indels by obtaining a base level alignment across them. Or, by finding the sequence inside the indel and realigning that, looking for multiple alignments. Or, simply by aligning things to a much larger number of candidates. Right now these tests are for only the best mapping, but for looking at TE movement you might want to look at the first N (e.g. 50) and set short segment length (1k seems reasonable for larger TEs).
Thanks for sharing rustybam, it looks quite useful. However, though removing the gaps would be helpful, this would not find the alignment corresponding to translocations. For example, I simulated 5 kb and 50 kb translocations and am curious whether wfmash can find alignments corresponding to these by default or would require parameter optimisation. ref.txt seq_up_trans5000.fa.txt seq_up_trans50000.fa.txt I tried to test it myself, however, wfmash (installed using conda) would not run as it complains about missing GSL.
I agree realigning indel regions could result in better breakpoints, I am just not sure how practical/efficient it would be to perform such alignments within syri. Ideally, I would rather prefer alignments generated by full-fledged aligners to be more precise at alignment ends.
From what I understand, in pan-genomes adding large indels is beneficial as that leads to easy to handle bubbles where as translocations and duplications result in problematic loops. So selection of large indels (at the cost of missing translocation/duplication information) is more of a design choice. But I guess that there would be other reasons as well.
Sorry about the conda installation of wfmash. We are aware of that, I think that @AndreaGuarracino may have a solution. This kind of issue is why I hope that fully containerized, and formally complete methods for software building and installation---like guix---become standard! https://github.com/waveygang/wfmash#guix
When making pangenome graphs, I'm often trying to get large-scale 1:1 homology relationships between genomes to be embedded in them. This leads to filtering for long 1:1 mappings in the wfmash mapping step. This is configurable, however. And, I think the way it gets set up is very application dependent so it is difficult to set generic defaults that will work in all contexts. I do want to be able to detect translocations natively in the alignment, and am not aware of issues detecting them should the length of the translocation be sufficient to make a separate mapping.
wfmash alignments are primarily defined by four key parameters, -s
, -l
, -p
, and -n
.
-s
: segment length. This is the seed length for mapping, and is akin to the kmer length used by minimap2, but several orders of magnitude larger. We make mappings using an extension of mashmap2 (let's call it mashmap3) which includes a second layer of chaining and mapping filtering that is designed to find 1:1 mappings across large SVs. In general, setting -s
lower makes the mapping more sensitive to small homologies. In practice, this can be set as low as 500bp even for relatively large genomes. Nevertheless, to reduce noise while getting very sensitive alignments, we use -s 10k
by default.
-l
: block length min, is a friend of -s
which filters mappings shorter than a given block length, which defaults to 5x (default) the length of -s
. This may be too aggressive in some contexts (like yours) but in others it is essential to reduce alignment noise. We do not expect a collinear chain of 5 segment mappings to occur very often by chance.
-p
: percent identity minimum for segment mappings. This sets a lower threshold on the ANI between segments that we consider "mapped" for the purpose of building larger mappings. This affects a number of parameters in the model, but is designed to be intuitive to use. We set -p 95
which is in general good for genomes of the same species.
-n
: number of mappings. By default, this is 1. But, setting it higher allow us to get multimappings (e.g. between two genomes) or N:N mappings in the case of pangenomes.
Thanks for the test case. Here's how I worked out how to get an alignment for the translocation.
First, I saw that the defaults didn't work, and I get a mapping but not the translocation.
wfmash ref.fa.gz seq_up_trans50000.fa >x.paf
pafplot x.paf
To attempt to catch the translocation, I set -s 1k
.
wfmash ref.fa.gz seq_up_trans50000.fa -s 1k >x.s1k.paf
This is one of the configurations that @baozg showed above.
This also catches your second, smaller example.
wfmash ref.fa.gz seq_up_trans5000.fa -s 1k >y.s1k.paf
Note that -s 10k
(default) doesn't pick up the 50kb example. It is near the edge of what we expect to be able to detect with that segment length. Specifically, it's filtered out with the -l
threshold, which is 50k by default. If we set -l 0
, then we do pick up the translocation.
wfmash ref.fa.gz seq_up_trans50000.fa -s 5k -l 0 >x.s5kl0.paf
Especially for TE-rich genome, like maize. wfmash's speed (30mins ) is much faster than minimap2 (~ 1day) for two accessions comparsion. We don't need to mask TE first, then align it with minimap2.
Btw I made a mistake in my explanation. The segment length default is 10kb, which is pretty long to detect events like this, but would work for larger translocations. I edited the comment to reduce future confusion.
Hi @mnshgl0110 , In my summary file, structural annotations have values only, but in sequence annotations, all counts are showing 0 only. like attached below
Syntenic regions 7188 2535240608 2528221404 Inversions 283 2464865 2439444 Translocations 8547 69687524 66748895 Duplications (reference) 955 6718295 - Duplications (query) 6400 - 26763094 Not aligned (reference) 13165 82618600 - Not aligned (query) 21911 - 80606870
SNPs 0 0 0 Insertions 0 - 0 Deletions 0 0 - Copygains 0 - 0 Copylosses 0 0 - Highly diverged 0 0 0 Tandem repeats 0 0 0
the command that I used was: syri -c out.filtered.coords -d out.filtered.delta -r reference.fasta -q query.fasta -f --nc 50 --prefix new --all
Thank you
Hi, @mnshgl0110
I am using wfmash alignment to do variant calling from the assmebly. Although
SyRI
support the PAF file since the version 1.6, but the native PAF fromwfmash
produce very different result. The main different from two PAFs areMQ
and some other tags.What tag need for SyRI?
Here is the code I use: