glennhickey / progressiveCactus

Distribution package for the Prgressive Cactus multiple genome aligner. Dependencies are linked as submodules
Other
79 stars 26 forks source link

Add genome to alignment #77

Open juansearch opened 7 years ago

juansearch commented 7 years ago

I have 5 mammalian genomes that I am trying to align using progressiveCactus. When I try to run all 5, it takes months on my limited available computational resources. However, I managed to run 2 of the genomes and got results in 2 days. Now I'm running 3 genomes, etc. Is there a way to add the 3rd genome to the exisiting alignment in a way that saves computational time? Assuming the tree is ((A,B),C) and I already aligned genomes "A" and "B". I want to add "C" to the alignment, without having to realign A and B. If I understand correctly how progressiveCactus works this should be doable. Thanks, Juan

joelarmstrong commented 7 years ago

Hmm, that's interesting. The runtime of running (A, B) then adding C should be about the same as the runtime for running ((A,B),C). Internally we do that exact type of progressive addition.

That said, if you need to add a genome C to an alignment, and you have (A, B)anc1;, (call this alignment 1) anc1 being the whatever the ancestor name of the first alignment is, you can compute the alignment (anc1, C)anc2; (call this alignment 2). Make sure the two alignments have the same name for genome anc1, but no other name collisions.

You will then need the hal file for alignment 2 (alignment2.hal) and the working directory for alignment 1 work_alignment1. (Make a backup of both in case something goes wrong.) You can then run cactus2hal.py --append work_alignment1/progressiveAlignment/progressiveAlignment_project.xml alignment2.hal. After that, alignment2.hal should contain A, B, and C aligned together. That is almost exactly how the progressive steps work internally.

rlim19 commented 6 years ago

I was curious on the process. Could you enlighten me, please?

As I understood, the 1st alignment, i.e, (A, B) -> anc1, which is in the hal format from the runProgressiveCactus.sh.

Therefore, for the subsequent run, i.e, (anc1, C) -> anc2, does it mean that anc1 needs to be converted from hal into fasta for the 2nd run?

Thanks in advance. I look forward to hearing from you.

Cheers

joelarmstrong commented 6 years ago

Yes, exactly. The reconstructed ancestor needs to be extracted from the hal file into a fasta file. You can run hal2fasta first.hal Anc1 > anc1.fa, assuming the first alignment is in first.hal with root genome Anc1. Then you can use that as a leaf in a subsequent alignment, and stitch the two alignments together afterward.

malcook commented 6 years ago

I'm delighted to find this thread here and wish for further enlightenment.

My aim is to produce genome-wide conservation analysis for myNewGenome, recently sequenced and assembled in-house. We have also performed differential ChipSeq of H3K27ac in myNewGenome as marker of putative functional enhancer, and seek to prioritize our "peaks" by their relative conservation.

So, I thought a shortcut might be to extend a pre-existing MSA such as multiz30way by adding in myNewGenome as new reference and recomputing genome-wide phastCons scores for myNewGenome which I could slice & dice as I chose to score the peaks.

Multiz30way is available as a MAF and could be converted to hal using maf2hal.

But of course I didn't build it, and it was not even aligned using Progressive Cactus.

So is this approach impossible?

If so, are there any known examples of it being fruitful (i.e. leading to high-quality MSA and publication)

Finally, are topics such as this discussed elsewhere, (i.e. a google group forum or newgroup or something)?

Thanks for Progressive Cactus.

glennhickey commented 6 years ago

I'd be very weary of aligning into the MAF-imported hal. With any luck @joelarmstrong has a cactus alignment with these species kicking around that he can share.

On Sun, Sep 10, 2017 at 9:38 AM, Malcolm Cook notifications@github.com wrote:

I'm delighted to find this thread here and wish for further enlightenment.

My aim is to produce genome-wide conservation analysis for myNewGenome, recently sequenced and assembled in-house. We have also performed differential ChipSeq of H3K27ac in myNewGenome as marker of putative functional enhancer, and seek to prioritize our "peaks" by their relative conservation.

So, I thought a shortcut might be to extend a pre-existing MSA such as multiz30way http://hgdownload.cse.ucsc.edu/goldenPath/mm9/multiz30way/ by adding in myNewGenome as new reference and recomputing genome-wide phastCons scores for myNewGenome which I could slice & dice as I chose to score the peaks.

Multiz30way is available as a MAF and could be converted to hal using maf2hal.

But of course I didn't build it, and it was not even aligned using Progressive Cactus.

So is this approach impossible?

If so, are there any known examples of it being fruitful (i.e. leading to high-quality MSA and publication)

Finally, are topics such as this discussed elsewhere, (i.e. a google group forum or newgroup or something)?

Thanks for Progressive Cactus.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/glennhickey/progressiveCactus/issues/77#issuecomment-328312022, or mute the thread https://github.com/notifications/unsubscribe-auth/AA2_7jiAGw3SRU2x-uT-dLyeNZXl6ebZks5sgy-JgaJpZM4MB8If .

malcook commented 6 years ago

Hi,

Thanks for the warning. Any details on why are welcome.

But should I find it even technically possible?

In fact I don't really want multiz30way. Probably DanRer7 conservation alignment is giong to suit my needs best. So, @joelarmstrong , if you happen to have reproduced that already in progressiveCactus it would be much appreciated.

In any case I’m sure I can re-align my own if need be... but having a leg up would be grand.

On a related subject I read in http://gensoft.pasteur.fr/docs/hal/v2.1/hal_README.pdf about “Optional support of PhyloP evolutionary constraint annotation” that you are “working on prototype support for running PhyloP on HAL files”. Is this work in progress still or should I expect to find it solid?

Thanks!

~malcolm_cook@stowers.org

joelarmstrong commented 6 years ago

Hi Malcolm,

Adding another genome into the maf2hal HAL isn't really possible, because Cactus relies on reconstructed ancestors as a sort of consensus sequence for progressive alignment. The "ancestor" that maf2hal creates isn't a consensus sequence, so using it in this sort of way would lead to pretty poor results if it works at all.

I don't have one with those exact genomes. I have a fish alignment with these genomes:

if that'd be helpful?

The phyloP stuff works and is pretty stable. We've also got a pipeline somewhere to run phastCons.

malcook commented 6 years ago

@joelarmstrong - understood - makes perfect sense. Thanks.

I expect I will produce home-brew alignment of fewer fish but adding in stickleback + human + mouse but I may take you up on your offer later.

However I am interested in learning what constitutes your phastCons pipeline if you'd be able to dig that out.

Thanks again

I note there was an underutilized google groups forum for cactus at
https://groups.google.com/forum/#!forum/cactususers and have move a few new questions over there.

https://groups.google.com/d/msg/cactususers/4A0Uh2FKlvM/dBhMORBTAgAJ

https://groups.google.com/d/msg/cactususers/4Lz0-IGybLQ/E_DYSipTAgAJ

malcook commented 6 years ago

Glenn,

On 2nd consideration, having the fish alignment would be grand contribution to our effort. I would gladly retrieve it by means of your choice, or you can put it to ftp://ftp.stowers.org/incoming/ . If you have accompanying .mod that would be great, and any recommendations regarding settings for phyloP or phastCons analysis. But having just the .hal would be great.

By the way our additional fish is killifish. If you happen to know that UCSD is working on it in preview I’d be much obliged to know too.

Thanks again for your offer!

~malcolm_cook@stowers.org

From: Joel Armstrong [mailto:notifications@github.com] Sent: Monday, September 11, 2017 5:59 PM To: glennhickey/progressiveCactus progressiveCactus@noreply.github.com Cc: Cook, Malcolm MEC@stowers.org; Comment comment@noreply.github.com Subject: Re: [glennhickey/progressiveCactus] Add genome to alignment (#77)

Hi Malcolm,

Adding another genome into the maf2hal HAL isn't really possible, because Cactus relies on reconstructed ancestors as a sort of consensus sequence for progressive alignment. The "ancestor" that maf2hal creates isn't a consensus sequence, so using it in this sort of way would lead to pretty poor results if it works at all.

I don't have one with those exact genomes. I have a fish alignment with these genomes:

if that'd be helpful?

The phyloP stuff works and is pretty stable. We've also got a pipeline somewhere to run phastCons.

— You are receiving this because you commented. Reply to this email directly, view it on GitHubhttps://github.com/glennhickey/progressiveCactus/issues/77#issuecomment-328682421, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AAdjuoWSLVjPGiS0_3bTyBtrTSJrk87Gks5shbtHgaJpZM4MB8If.

malcook commented 6 years ago

Hi @joelarmstrong,

I addressed my prior follow-up to your kind reply not to you but mistakenly to Glenn. My apologies.

If the offer still stands to review your Progressive Cactus fish alignment, to which I might progressively align in an addition teleost genome, I would be obliged.

Our additional gnome is Turquoise Killifish. I see that UCSC has some plans around Killifish... if there are resources in preview you can alert me to, I’d be much obliged to know too.

Any accompanying .mod would be great, along with any recommendations regarding settings for phyloP or phastCons analysis. But having just the .hal would be great.

I would gladly retrieve any of this by means of your choice, including you can send it to ftp://ftp.stowers.org/incoming/ and alert me of its name.

Thanks again for your offer, and I understand if time does not permit. Please advise!

~malcolm_cook@stowers.org

malcook commented 5 years ago

@joelarmstrong - hi - checking in again about availability of MSA's involving Zebra fish - are there any updates to what you offered earlier (possibly based on using danRer11, or including any new additional outgroups). If not, Id still be interested in what you offered above. Thanks!