Closed ohdongha closed 1 year ago
I don't know what Mash distances are, as I like to think in terms of 'subs. per neutral site'. Anyway, human-zebrafish has a distance of >2 subs. per neutral site (Fig 1 in https://academic.oup.com/nar/article-lookup/doi/10.1093/nar/gkt557) and whole genome alignment is kind of not so useful over such distances as only coding exons and a few thousand highly conserved CNEs will align.
Our default pipeline that we also use for much more closely-related species has already very sensitive parameters (lineageGap = loose, as you pointed out). One could reduce L to 2200 and HoxD55 is better suited for larger distances.
It shouldn't be difficult for Bogdan to add a parameter for the matrix file, but maybe it is faster if you try to add that yourself (he can then git pull your code).
It shouldn't be difficult for Bogdan to add a parameter for the matrix file, but maybe it is faster if you try to add that yourself (he can then git pull your code).
I realize that due to the backward compatibility of LASTZ
with BLASTZ
and the way the BLASTZ_X parameters are parsed, there is no need to change the script.
If I want to add Q=HoxD55.q
then I can simply add a line BLASTZ_Q=HoxD55.q
to the DEF
file. I guess any single letter BLASTZ
parameter can be transferred similarly.
...
Now I am looking into if I need to also modify the axtChain -minScore=N
option (default=1000).
To better reveal the (visual) synteny between distant species pairs, it may be beneficial to allow chaining more loosely. Will decreasing axtChain -minScore
value (to, say, 500 or 800) result in longer chains? @MichaelHiller Have you played with this parameter?
Thanks! Dong-Ha
Glad to hear that the matrix can be easily added via the DEF file.
I wouldn't lower the min chain score. In fact, random alignments often result in chain that score higher. Most chains < 1000 (likely < 5000) are random. Real alignments will have scores of 100000 and more.
I wouldn't lower the min chain score. In fact, random alignments often result in chain that score higher. Most chains < 1000 (likely < 5000) are random. Real alignments will have scores of 100000 and more.
I see. This is good to know. Then, will increasing axtChain -minScore
to 5000 result in cleaner chains? I have seen an alignment (human-zebrafish) using 5000 resulted in (after post-processing by both UCSC methods and ours) a higher coverage by "reciprocally best" alignments. (I wondered why but now it begins to make sense :) )
Thanks! Dong-Ha
Probably, but such chains are highly unlikely to be relevant for TOGA (unless you have a very fragmented assembly and TOGA tries to assemble the gene from orthologous fragments).
We are now using both BLASTZ_Q=HoxD55.q
and axtChain -minScore 5000
(by modifying to $chainMinScore = "5000"
in doLastzChains/doLastzChain.pl
) for more distant species pairs.
BLASTZ_Q=HoxD55.q
detects more alignments covering CDS (and also intergenic sequences) but also tends to produce more fragmented alignments after post-processing.chainMinScore=5000
alleviates the fragmentation of alignments while only slightly reducing the alignment coverage on CDS, etc. So these two worked together well (for now) ;)
Thanks again for your help, and I will close this ticket.
Hi, Bogdan and Michael @kirilenkobm @MichaelHiller,
Thanks again for developing this pipeline and also for responding to our requests. We have successfully used the pipeline for species pairs with MASH distances ranging from 0.1 to 0.27, with the default parameters.
In the previous ticket #10, the issue was runtime and RAM usage for close species (e.g., MASH distance <0.1, such as human vs. primates). How about more distant species pairs, e.g., human vs. zebrafish with MASH distance 0.29?
The default
make_lastz_chains
parameters are already quite sensitive. But I'd like to know if there is a way to tweak it even more for distant species, aiming to retrieve as much synteny information for orthologs as possible.UCSC has these example parameter sets:
make_lastz_chains
can accept different K, L, H, and Y options, and as Michael pointed out in an email with us, K and L might be the most important. I will try with K=2200 and L=6000 (make_lastz_chains
default: K=2400 and L=3000)Could you also add options to control the following:
make_lastz_chains
already uses-linearGap=loose
as default).Plus, I would appreciate any other suggestions for distant species, with the aim of retrieving as much synteny information for orthologs as possible.
Thanks! Dong-Ha