glennhickey / progressiveCactus

Distribution package for the Prgressive Cactus multiple genome aligner. Dependencies are linked as submodules
Other
79 stars 26 forks source link

Modify Cactus to use an aligner different from LASTZ #94

Open lucventurini opened 6 years ago

lucventurini commented 6 years ago

Currently, progressiveCactus requires a long time to run ( 120 * (n-1) CPU days for n genomes, from another thread here) and therefore requires leveraging a cluster, which we cannot do at my institution due to the fact that we use the unsupported SLURM as a solution. An alternative would be to use a faster aligner than LASTZ, so to make all steps in the pipeline quicker. Would it be feasible to do so? What would the requirements for a substitutive aligner in terms of e.g. output format and alignment options? It would also be useful to understand which parts of the progressiveCactus code deal with the alignment and parse the output; I am starting to read the source code, but I am not very familiar with luigi-like pipelines, so having some pointers would greatly help in my understanding of the pipeline functioning!

joelarmstrong commented 6 years ago

Hi Luca,

Yeah, you will need a cluster for pretty much any alignment larger than a few hundred megabases. We have a new, beta, version of Cactus (not under this repository) that uses a different job scheduling system, Toil, which supports SLURM, at least in theory :). The beta version is available at https://github.com/ComparativeGenomicsToolkit/cactus .

We've tried replacing LASTZ with LAST before but decided against even keeping it as an option, because we saw a large drop in sensitivity for highly diverged genomes (which might easily have been our fault and not LAST's). What aligner were you thinking of replacing LASTZ with? Depending on how similar your genomes are to one another, a drop in sensitivity might not hurt much.

Replacing LASTZ would be pretty tricky. The main requirements we have for the aligner are that it outputs in CIGAR format. If you wanted to go down that path, you would probably be better off doing it in this hacky way. You could replace Cactus's cactus_lastz executable with a script that takes in the arguments intended for LASTZ, parses out the FASTA paths, runs another aligner, converts the output to CIGAR format, and spits out the CIGARs to stdout just like LASTZ would.

I think that admittedly hacky approach would almost certainly be quicker than trying to modify Cactus's Python pipeline. If you wanted to take a stab at it, though, most of the LASTZ work is done in the RunSelfBlast and RunBlast jobs (https://github.com/ComparativeGenomicsToolkit/cactus/blob/7af8f26e6e1d0f0ec19cdf0941766763bb948859/src/cactus/blast/blast.py#L389).

lucventurini commented 6 years ago

OK, thanks! I was thinking of using Minimap2 (https://github.com/lh3/minimap2) because on my tests on Arabidopsis, minimap2 + CESAR2 beated progressiveCactus + CAT. Minimap2 though is not perfect for my purpose, so I am looking into expanding its scope by dropping it into Cactus.

It should not be too difficult if the only requirement is the creation of a CIGAR alignment, I will hack a bit and see what comes through.

And thanks for pointing me to the beta, being able to leverage our cluster will be useful no matter what!

malcook commented 6 years ago

Hello Joel,

I read about this beta version of cactus with interest.

Can you outline how its planned capabilities relate to those of progressiveCactus? In particular, is it ‘not progressive’?

I have been having some issues using progressiveCactus related to the scheduler, and wonder what if anything I might lose if I move to cactus, which by you comments seems to be active in improving scheduler support.

Also, I am having some difficutly building cactus (under Centos7), getting the following in submodule, and wonder what advice you may be able to offer…

/home/mec/local/src/cactus/submodules/sonLib/lib/sonLib.a(stThreadPool.o): In function stThreadPool_construct': /home/mec/local/src/cactus/submodules/sonLib/C/impl/stThreadPool.c:144: undefined reference topthread_create' /home/mec/local/src/cactus/submodules/sonLib/C/impl/stThreadPool.c:144: undefined reference to pthread_create' /home/mec/local/src/cactus/submodules/sonLib/C/impl/stThreadPool.c:144: undefined reference topthread_create' /home/mec/local/src/cactus/submodules/sonLib/C/impl/stThreadPool.c:144: undefined reference to pthread_create' /home/mec/local/src/cactus/submodules/sonLib/C/impl/stThreadPool.c:144: undefined reference topthread_create' /home/mec/local/src/cactus/submodules/sonLib/lib/sonLib.a(stThreadPool.o):/home/mec/local/src/cactus/submodules/sonLib/C/impl/stThreadPool.c:144: more undefined references to pthread_create' follow /home/mec/local/src/cactus/submodules/sonLib/lib/sonLib.a(stThreadPool.o): In functionstThreadPool_destruct': /home/mec/local/src/cactus/submodules/sonLib/C/impl/stThreadPool.c:195: undefined reference to `pthread_join' collect2: error: ld returned 1 exit status make[1]: [../bin//stCafTests] Error 1 make: [all.caf] Error 2

Thanks for your considerations,

edit: my workaround/resolution is reported as "-lpthread apparently absent from cactus/caf/Makefile" here: https://github.com/ComparativeGenomicsToolkit/cactus/issues/19

I remain interested in cactus v progressive cactus topic.

Thanks!