schneebergerlab / syri

Synteny and Rearrangement Identifier
https://schneebergerlab.github.io/syri/
MIT License
323 stars 35 forks source link

Support of wfmash PAF as input #144

Open baozg opened 2 years ago

baozg commented 2 years ago

Hi, @mnshgl0110

I am using wfmash alignment to do variant calling from the assmebly. Although SyRI support the PAF file since the version 1.6, but the native PAF from wfmash produce very different result. The main different from two PAFs are MQ and some other tags.
What tag need for SyRI?

Here is the code I use:

minimap2 -x asm5 --eqx ref.fa query.fa > minimap2.paf
wfmash -t 32 -p 90 -s 10000 ref.fa query.fa > wfmash.paf

syri -c minimap2.paf -r re.fa -q query.fa -F P
syri -c wfmash.paf -r re.fa -q query.fa -F P
mnshgl0110 commented 2 years ago

Apart from the required 12 columns in PAF, syri also requires a column containig cg tag and corresponding CIGAR string (with =/X, and not M; like the output from the --eqx parameter from minimap2). I would guess that if wfmash_PAF has that, then it would work fine.

baozg commented 2 years ago

wfmash did have cg tag but the result is very different. I will check the CIGAR.

minimap2 wfmash
chr09 ch09
83447412 69406105
76599034 6700000
81799482 6760000
+ +
chr9 chr09
69406105 83447412
63131656 6809033
68327197 6862542
5194689 44010
5200998 60000
60 5
tp:A:P gi:f:94.3287
mm:i:302 bi:f:65.6004
gn:i:6007 md:f:94.1134
go:i:215 wt:i:1393
cg:Z:117673=12I7732=1I7809=12I56783=14I8890 pt:i:101
  aa:i:9498
  ap:i:798
  cg:Z:48=1X13=1X10=1X8=1X7=1X7=1X4=1X10=1X34=1X63=1X6=1X4=1X1
ekg commented 2 years ago

The cigar is in the correct format. What other dependencies are there? Mapping quality is not set the same way.

mnshgl0110 commented 2 years ago

I assume you mean requirements for the PAF to be read correctly by syri. Syri only uses alignment coordinates and cigar (from which it then calculates alignment identity). So I would guess that this PAF should work fine. Mapping quality values are not used so it is OK if these values are very different.

baozg commented 2 years ago

I use Col-0 and Ler-0 as test, but minimap2 and wfmash paf have substtational difference.

  1. Command
    
    # minimap2  2.24
    # wfmash 0.8.2
    # syri 1.6 

minimap2 -x asm5 -c -t 5 --eqx TAIR10.fa.gz Ler0.fa.gz > minimap2.paf wfmash -t 32 -p 95 -s 1k TAIR10.fa.gz Ler0.fa.gz > wfmash.paf

syri -c minimap2.paf -q Ler0.fa.gz -r TAIR10.fa.gz -F P --prefix minimap2 syri -c wfmash.paf -q Ler0.fa.gz -r TAIR10.fa.gz -F P --prefix wfmash



2. Synteny plot from [paf2dotplot](https://github.com/waveygang/wfmash/blob/master/scripts/paf2dotplot)
<img width="643" alt="image" src="https://user-images.githubusercontent.com/20680150/168740301-8bfbbdae-8a3e-4b99-9dba-cbd6bf5da7da.png">

3. Summary of syri
- minimap2

<img width="445" alt="image" src="https://user-images.githubusercontent.com/20680150/168743789-4751f9d7-0d92-4ee0-a77e-e54080cb10e9.png">

- wfmash (1k)
<img width="447" alt="截屏2022-05-17 下午2 33 36" src="https://user-images.githubusercontent.com/20680150/168744589-c0d23b00-0eba-4106-b641-7c5390c7d5ef.png">

- wfmash (5k)
ekg commented 2 years ago

@baozg is the amount of sequence in alignments measured from the input PAFs this different?

Is there some filtering in syri that might be kicking out alignments?

baozg commented 2 years ago

From basic awk, the amout aligned region of wfmash is 109Mb, and minimap2 is 108Mb. @mnshgl0110 Maybe SyRI have some filtering step to remove repeat region ?

mnshgl0110 commented 2 years ago

By default, syri filters out alignments with identity <90% and length <100bp in both reference and query. You can turn this filtering off by using the -f parameter.

There is 14MB difference in the length of not aligning regions, so possible that if there are multiple small or low identity alignments generated by wfmash, then they were filtered out.

ekg commented 2 years ago

Is the identity gap compressed or block/blast identity?

On Tue, May 17, 2022, 14:41 Manish Goel @.***> wrote:

By default, syri filters out alignments with identity <90% and length <100bp in both reference and query. You can turn this filtering off by using the -f parameter.

There is 14MB difference in the length of not aligning regions, so possible that if there are multiple small or low identity alignments generated by wfmash, then they were filtered out.

— Reply to this email directly, view it on GitHub https://github.com/schneebergerlab/syri/issues/144#issuecomment-1128820301, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABDQEJYGPTPT2O53YLQNZLVKOHXZANCNFSM5VUJFQNQ . You are receiving this because you commented.Message ID: @.***>

ekg commented 2 years ago

Expanding... I would not expect alignments <100bp. The minimum possible length should be 5x the segment size so around 5kb or 25kb in these tests.

So to be filtering out so much of the alignment, then it would seem that the identity filter is applied. If that's computed as matches/total alignment block then I could see this being rather strict. For instance a 1kb indel in a 10kb alignment would be filtered out. But if gaps are "compressed" in this calculation (counted as a single character irrespective of length) then something else might be going on.

Hypotheses...

baozg commented 2 years ago

Excellent. I turn off the filtering step of SyRI and wfmash result have much more variants than minimap2. Maybe I can tune the parameters of wfmash for variant calling.

image
mnshgl0110 commented 2 years ago

For instance a 1kb indel in a 10kb alignment would be filtered out.

syri would indeed filter out such alignments and based on @baozg's latest results it seems that was the case all along. I will add some warning message to notify when large alignments get filtered out.

@ekg Is it possible to configure wfmash to actually break at such divergent regions and not align these gaps? These large gaps are often result of TE translocation/duplication. Aligning these regions as indels limits selection of alignments corresponding to actual TE movement. As a result, more indels are found and the genetic information that they are translocations/duplications is lost.

ekg commented 2 years ago

You can break PAF on gaps greater than a given length using @mrvollger's rustybam. https://github.com/mrvollger/rustybam

It's interesting to see the other side of this tradeoff. For pangenome graph building we try to obtain the longest syntenic alignments. This can hide some information about translocations and TE movements. Although I do wonder if you wouldn't be able to better identify breakpoints around indels by obtaining a base level alignment across them. Or, by finding the sequence inside the indel and realigning that, looking for multiple alignments. Or, simply by aligning things to a much larger number of candidates. Right now these tests are for only the best mapping, but for looking at TE movement you might want to look at the first N (e.g. 50) and set short segment length (1k seems reasonable for larger TEs).

mnshgl0110 commented 2 years ago

Thanks for sharing rustybam, it looks quite useful. However, though removing the gaps would be helpful, this would not find the alignment corresponding to translocations. For example, I simulated 5 kb and 50 kb translocations and am curious whether wfmash can find alignments corresponding to these by default or would require parameter optimisation. ref.txt seq_up_trans5000.fa.txt seq_up_trans50000.fa.txt I tried to test it myself, however, wfmash (installed using conda) would not run as it complains about missing GSL.

I agree realigning indel regions could result in better breakpoints, I am just not sure how practical/efficient it would be to perform such alignments within syri. Ideally, I would rather prefer alignments generated by full-fledged aligners to be more precise at alignment ends.

From what I understand, in pan-genomes adding large indels is beneficial as that leads to easy to handle bubbles where as translocations and duplications result in problematic loops. So selection of large indels (at the cost of missing translocation/duplication information) is more of a design choice. But I guess that there would be other reasons as well.

ekg commented 2 years ago

Sorry about the conda installation of wfmash. We are aware of that, I think that @AndreaGuarracino may have a solution. This kind of issue is why I hope that fully containerized, and formally complete methods for software building and installation---like guix---become standard! https://github.com/waveygang/wfmash#guix

When making pangenome graphs, I'm often trying to get large-scale 1:1 homology relationships between genomes to be embedded in them. This leads to filtering for long 1:1 mappings in the wfmash mapping step. This is configurable, however. And, I think the way it gets set up is very application dependent so it is difficult to set generic defaults that will work in all contexts. I do want to be able to detect translocations natively in the alignment, and am not aware of issues detecting them should the length of the translocation be sufficient to make a separate mapping.

wfmash alignments are primarily defined by four key parameters, -s, -l, -p, and -n.

-s: segment length. This is the seed length for mapping, and is akin to the kmer length used by minimap2, but several orders of magnitude larger. We make mappings using an extension of mashmap2 (let's call it mashmap3) which includes a second layer of chaining and mapping filtering that is designed to find 1:1 mappings across large SVs. In general, setting -s lower makes the mapping more sensitive to small homologies. In practice, this can be set as low as 500bp even for relatively large genomes. Nevertheless, to reduce noise while getting very sensitive alignments, we use -s 10k by default.

-l: block length min, is a friend of -s which filters mappings shorter than a given block length, which defaults to 5x (default) the length of -s. This may be too aggressive in some contexts (like yours) but in others it is essential to reduce alignment noise. We do not expect a collinear chain of 5 segment mappings to occur very often by chance.

-p: percent identity minimum for segment mappings. This sets a lower threshold on the ANI between segments that we consider "mapped" for the purpose of building larger mappings. This affects a number of parameters in the model, but is designed to be intuitive to use. We set -p 95 which is in general good for genomes of the same species.

-n: number of mappings. By default, this is 1. But, setting it higher allow us to get multimappings (e.g. between two genomes) or N:N mappings in the case of pangenomes.

Thanks for the test case. Here's how I worked out how to get an alignment for the translocation.

First, I saw that the defaults didn't work, and I get a mapping but not the translocation.

wfmash ref.fa.gz seq_up_trans50000.fa >x.paf
pafplot x.paf

image

To attempt to catch the translocation, I set -s 1k.

wfmash ref.fa.gz seq_up_trans50000.fa -s 1k >x.s1k.paf

image

This is one of the configurations that @baozg showed above.

This also catches your second, smaller example.

wfmash ref.fa.gz seq_up_trans5000.fa -s 1k >y.s1k.paf

image

Note that -s 10k (default) doesn't pick up the 50kb example. It is near the edge of what we expect to be able to detect with that segment length. Specifically, it's filtered out with the -l threshold, which is 50k by default. If we set -l 0, then we do pick up the translocation.

wfmash ref.fa.gz seq_up_trans50000.fa -s 5k -l 0 >x.s5kl0.paf

image

baozg commented 2 years ago

Especially for TE-rich genome, like maize. wfmash's speed (30mins ) is much faster than minimap2 (~ 1day) for two accessions comparsion. We don't need to mask TE first, then align it with minimap2.

ekg commented 2 years ago

Btw I made a mistake in my explanation. The segment length default is 10kb, which is pretty long to detect events like this, but would work for larger translocations. I edited the comment to reduce future confusion.

ASBioinfo commented 2 years ago

Hi @mnshgl0110 , In my summary file, structural annotations have values only, but in sequence annotations, all counts are showing 0 only. like attached below

Structural annotations

Variation_type Count Length_ref Length_qry

Syntenic regions 7188 2535240608 2528221404 Inversions 283 2464865 2439444 Translocations 8547 69687524 66748895 Duplications (reference) 955 6718295 - Duplications (query) 6400 - 26763094 Not aligned (reference) 13165 82618600 - Not aligned (query) 21911 - 80606870

Sequence annotations

Variation_type Count Length_ref Length_qry

SNPs 0 0 0 Insertions 0 - 0 Deletions 0 0 - Copygains 0 - 0 Copylosses 0 0 - Highly diverged 0 0 0 Tandem repeats 0 0 0

the command that I used was: syri -c out.filtered.coords -d out.filtered.delta -r reference.fasta -q query.fasta -f --nc 50 --prefix new --all

Thank you