Alignment options for distant species,

ohdongha commented 1 year ago

Hi, Bogdan and Michael @kirilenkobm @MichaelHiller,

Thanks again for developing this pipeline and also for responding to our requests. We have successfully used the pipeline for species pairs with MASH distances ranging from 0.1 to 0.27, with the default parameters.

In the previous ticket #10, the issue was runtime and RAM usage for close species (e.g., MASH distance <0.1, such as human vs. primates). How about more distant species pairs, e.g., human vs. zebrafish with MASH distance 0.29?

The default make_lastz_chains parameters are already quite sensitive. But I'd like to know if there is a way to tweak it even more for distant species, aiming to retrieve as much synteny information for orthologs as possible.

UCSC has these example parameter sets:

chainNear="-minScore=5000 -linearGap=medium"
chainMedium="-minScore=3000 -linearGap=medium"
chainFar="-minScore=5000 -linearGap=loose"
lastzNear="C=0 E=150 H=0 K=4500 L=3000 M=254 O=600 Q=/scratch/data/blastz/human_chimp.v2.q T=2 Y=15000"
lastzMedium="C=0 E=30 H=0 K=3000 L=3000 M=50 O=400 T=1 Y=9400"
lastzFar="C=0 E=30 H=2000 K=2200 L=6000 M=50 O=400 Q=/scratch/data/blastz/HoxD55.q T=2 Y=3400"

make_lastz_chains can accept different K, L, H, and Y options, and as Michael pointed out in an email with us, K and L might be the most important. I will try with K=2200 and L=6000 ( make_lastz_chains default: K=2400 and L=3000)

Could you also add options to control the following:

the "-minScore" for the chaining step (I think make_lastz_chains already uses -linearGap=loose as default).
the Q parameter, i.e., the matrix file, so that I can test if using HoxD55.q would make a difference.

Plus, I would appreciate any other suggestions for distant species, with the aim of retrieving as much synteny information for orthologs as possible.

Thanks! Dong-Ha

MichaelHiller commented 1 year ago

I don't know what Mash distances are, as I like to think in terms of 'subs. per neutral site'. Anyway, human-zebrafish has a distance of >2 subs. per neutral site (Fig 1 in https://academic.oup.com/nar/article-lookup/doi/10.1093/nar/gkt557) and whole genome alignment is kind of not so useful over such distances as only coding exons and a few thousand highly conserved CNEs will align.

Our default pipeline that we also use for much more closely-related species has already very sensitive parameters (lineageGap = loose, as you pointed out). One could reduce L to 2200 and HoxD55 is better suited for larger distances.

It shouldn't be difficult for Bogdan to add a parameter for the matrix file, but maybe it is faster if you try to add that yourself (he can then git pull your code).

ohdongha commented 1 year ago

It shouldn't be difficult for Bogdan to add a parameter for the matrix file, but maybe it is faster if you try to add that yourself (he can then git pull your code).

I realize that due to the backward compatibility of LASTZ with BLASTZ and the way the BLASTZ_X parameters are parsed, there is no need to change the script.

If I want to add Q=HoxD55.q then I can simply add a line BLASTZ_Q=HoxD55.q to the DEF file. I guess any single letter BLASTZ parameter can be transferred similarly.

...

Now I am looking into if I need to also modify the axtChain -minScore=N option (default=1000).

To better reveal the (visual) synteny between distant species pairs, it may be beneficial to allow chaining more loosely. Will decreasing axtChain -minScore value (to, say, 500 or 800) result in longer chains? @MichaelHiller Have you played with this parameter?

Thanks! Dong-Ha

MichaelHiller commented 1 year ago

Glad to hear that the matrix can be easily added via the DEF file.

I wouldn't lower the min chain score. In fact, random alignments often result in chain that score higher. Most chains < 1000 (likely < 5000) are random. Real alignments will have scores of 100000 and more.

ohdongha commented 1 year ago

I wouldn't lower the min chain score. In fact, random alignments often result in chain that score higher. Most chains < 1000 (likely < 5000) are random. Real alignments will have scores of 100000 and more.

I see. This is good to know. Then, will increasing axtChain -minScore to 5000 result in cleaner chains? I have seen an alignment (human-zebrafish) using 5000 resulted in (after post-processing by both UCSC methods and ours) a higher coverage by "reciprocally best" alignments. (I wondered why but now it begins to make sense :) )

Thanks! Dong-Ha

MichaelHiller commented 1 year ago

Probably, but such chains are highly unlikely to be relevant for TOGA (unless you have a very fragmented assembly and TOGA tries to assemble the gene from orthologous fragments).

ohdongha commented 1 year ago

We are now using both BLASTZ_Q=HoxD55.q and axtChain -minScore 5000 (by modifying to $chainMinScore = "5000" in doLastzChains/doLastzChain.pl) for more distant species pairs.

BLASTZ_Q=HoxD55.q detects more alignments covering CDS (and also intergenic sequences) but also tends to produce more fragmented alignments after post-processing.
chainMinScore=5000 alleviates the fragmentation of alignments while only slightly reducing the alignment coverage on CDS, etc.

So these two worked together well (for now) ;)

Thanks again for your help, and I will close this ticket.

hillerlab / make_lastz_chains

Alignment options for distant species, #12