make_Lastz on Cactus-447-mammalian-genome dataset

hillerlab / make_lastz_chains

Portable solution to generate genome alignment chains using lastz

MIT License

49 stars 8 forks source link

make_Lastz on Cactus-447-mammalian-genome dataset #69

Open KabitaBaral1 opened 3 weeks ago

KabitaBaral1 commented 3 weeks ago

Hi, I have a question regarding running LASTZ similar to what they did for the TOGA paper. In my case, I have Cactus 447 mammalian genome dataset. I converted it from Hal to fasta, removed ancestral sequences. and now I have two fasta files from that dataset: one with just human genome sequence and another with the rest 446 mammalian genomes as one fasta file. I am wondering if I can run make_lastz_chains on that query fasta file? thank you.

MichaelHiller commented 3 weeks ago

Good question. I think there is no point of extracting the genomic fasta seqs from the Cactus alignment and then aligning them again to human to get chains. If you want to do that, you can also just start with the full genomes of these species.

But I guess the best would be to extract pairwise alignments (in chain format) from the cactus alignment. This should hopefully be possible, but how to do this is something that should pls be directed to Benedict Paten and the Cactus developers.

KabitaBaral1 commented 2 weeks ago

Hi Michael, Thank you for getting back to me. I have a couple of follow-up questions. I am trying to run LASTZ & then TOGA to get coordinates of protein-coding regions for all 447 mammalian genomes in the Cactus dataset. I thought that similar to your TOGA paper, the approach would be to perform LASTZ and then TOGA on the dataset. Is there a better way to do this? Or an alternative? "If you want to do that, you can also start with the full genomes of these species." Could you please elaborate on this? Thank you

MichaelHiller commented 2 weeks ago

Hi,

the coordinates of all orthologs that TOGA found are in the bed or gtf files we provided. If this is what you need, you don't have to run anything.

If you have new genomes, then the easiest is to align them using our lastz/chain pipeline to a reference and then running TOGA.

Hope this helps