mcfrith / last-genome-alignments

47 stars 5 forks source link

how to merge many maf? #3

Closed SCQUchenyang closed 5 years ago

SCQUchenyang commented 5 years ago

Hi, Sir,I am so thankful for your work of LAST. And I have a question when I use it. To save tima,I have align my 10 chromosomes to a reference by parallelizing,and I got 10 maf files. So,what should I do to merge these maf files? Is "cat" useful? Best wishes!

mcfrith commented 5 years ago

Hi, many thanks for your interest in LAST. Yes, I think "cat" is just fine here. Have a nice day, Martin

AlisaGU commented 2 years ago

Hi, Sir,I am so thankful for your work of LAST. And I have a question when I use it. To save tima,I have align my 10 chromosomes to a reference by parallelizing,and I got 10 maf files. So,what should I do to merge these maf files? Is "cat" useful? Best wishes!

Hi, Did commands run successfully if you just cat these maf files? I am worried about multiple alignments to the same region of reference and the order of alignment block.

mcfrith commented 2 years ago

So the recipe uses last-split twice. Doing cat after the 1st last-split, and before the 2nd one, should be completely fine.

AlisaGU commented 2 years ago

Hi, Is 1st and 2nd last-split referred to following example extracted from cookbook?

lastdb -P8 -uMAM8 myDB genome1.fa

last-train -P8 --revsym -D1e9 --sample-number=5000 myDB genome2.fa > my.train

lastal -P8 -D1e9 -m100 -p my.train myDB genome2.fa | last-split -fMAF+ > many-to-one.maf

last-split -r many-to-one.maf | last-postmask > out.maf

By the way, I have a big genome with 46,139,523,234 bases and 20131 contigs. Here is the summary of the longest contigs.

ptg000004l      169819904
ptg000441l      158822330
ptg000279l      109104046
ptg000035l      107045360
ptg000669l      100503328
ptg000533l      90735505
ptg000066l      87495606
ptg000800l      85918877
ptg000855l      82319672
ptg000667l      80863498

Is Last suitable for this very big genome by split chromosomes and cat?

mcfrith commented 2 years ago

Yes, that's the 1st and 2nd last-split.

Wow, big genome! Is that "genome1" or "genome2"? How big is the other genome?

AlisaGU commented 2 years ago

This big genome is query. Target is a normal size genome.

mcfrith commented 2 years ago

I see.

I'm pretty sure LAST can be suitable, but you might want to tune the parameters for higher speed and not-quite-so-high sensitivity. So I would omit -m100 and -uMAM8. For higher speed, maybe replace -uMAM8 with -uRY4 or -uRY8. (The fastest one is -uRY32.) I would probably add -C2 to the lastal options.

For step 3 (lastal ... | last-split), it's fine to run the query chromosomes separately and then cat the results of step 3. Whether or not you do that makes no difference to the results.

(By the way, --sample-number=5000 was used for highly-diverged genomes, e.g. mammal versus reptile. It may not be necessary if your genomes are less diverged. last-train by default uses a random sample of 500 2kb fragments from genome2.fa. I was worried that might not be enough for genomes with only a small fraction of alignable regions.)

AlisaGU commented 2 years ago

Hi, I have run LAST successfully😀 and will tune parameters according to your advices. Now I am curious whether to do chain and net steps like UCSC methods. Do you have some tips?

Message ID: @.***>

mcfrith commented 2 years ago

I believe chain-and-net is an alternative to last-split, and last-split is better (but I am biased). Here is a comparison: https://genomebiology.biomedcentral.com/articles/10.1186/s13059-015-0670-9

AlisaGU commented 2 years ago

🥳🥳🥳Thanks, LAST is indeed very fast.