Unable to run example with minimal install

robsyme commented 6 years ago

Hi

I'm trying to install and run progressiveCactus in a container to improve the reproducibility of my results and to include the software in a genome annotation pipeline.

Using a fairly minimal Dockerfile:

FROM ubuntu:16.04

MAINTAINER Rob Syme <rob.syme@gmail.com>

RUN apt-get update \
&& apt-get install -qqy \
git \
wget \
unzip \
build-essential

# Install ProgressiveCactus
RUN apt-get install -qqy \
 python \
 python-dev \
 python-numpy

WORKDIR /usr/local

# Install progressiveCactus
RUN ln -s /usr/lib/python2.7/plat-*/_sysconfigdata_nd.py /usr/lib/python2.7/
RUN git clone git://github.com/glennhickey/progressiveCactus.git \
&& cd progressiveCactus \
&& git checkout tags/0.1 -b 0.1 \
&& git submodule update --init
RUN cd progressiveCactus && make

ENV PYTHONPATH /usr/local/progressiveCactus/submodules
ENV PATH /bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/usr/local/progressiveCactus/bin:/usr/local/progressiveCactus/submodules/kentToolBinaries

This installs, but running the example gives the following errors:

# cd progressiveCactus/
# source environment
(python)# bin/runProgressiveCactus.sh examples/blanchette00.txt ./work ./work/b00.hal

Beginning Alignment
Error: Command: jobTreeStatus --failIfNotComplete --jobTree ./work/jobTree > /dev/null 2>&1  exited with non-zero status 1

Temporary data was left in: ./work
More information can be found in ./work/cactus.log

Looking at the logs suggests that jobTree is expecting to find a particular file, but runs into a OSError: [Errno 2] No such file or directory

# tail -n 50  ./work/cactus.log
Got message from job at time: 1516334120.73 : Cumulative coverage of 1 outgroups on ingroup COW_8: 60.5300136917
Got message from job at time: 1516334120.73 : Cumulative coverage of 1 outgroups on ingroup PIG_7: 70.1055740933
Got message from job at time: 1516334120.73 : Coverage on COW_8 from outgroup #2, CAT_6: 25.2618574421% (current ingroup length 18617, untrimmed length 55508). Outgroup trimmed to 34588 bp from 50283
Got message from job at time: 1516334120.73 : Coverage on PIG_7 from outgroup #2, CAT_6: 31.8123116659% (current ingroup length 11615, untrimmed length 54843). Outgroup trimmed to 34588 bp from 50283
Got message from job at time: 1516334120.73 : Cumulative coverage of 2 outgroups on ingroup COW_8: 68.4910283202
Got message from job at time: 1516334120.73 : Cumulative coverage of 2 outgroups on ingroup PIG_7: 76.5184253232
Got message from job at time: 1516334120.73 : Coverage on COW_8 from outgroup #3, HUMAN_0: 5.11479881791% (current ingroup length 13197, untrimmed length 55508). Outgroup trimmed to 17113 bp from 57553
Got message from job at time: 1516334120.73 : Coverage on PIG_7 from outgroup #3, HUMAN_0: 19.4097460535% (current ingroup length 7285, untrimmed length 54843). Outgroup trimmed to 17113 bp from 57553
Got message from job at time: 1516334120.73 : Cumulative coverage of 3 outgroups on ingroup COW_8: 69.5719535923
Got message from job at time: 1516334120.73 : Cumulative coverage of 3 outgroups on ingroup PIG_7: 78.9526466459
The job seems to have left a log file, indicating failure: /usr/local/progressiveCactus/work/jobTree/jobs/t0/job
Reporting file: /usr/local/progressiveCactus/work/jobTree/jobs/t0/log.txt
log.txt:        ---JOBTREE SLAVE OUTPUT LOG---
log.txt:        Traceback (most recent call last):
log.txt:          File "/usr/local/progressiveCactus/submodules/jobTree/src/jobTreeSlave.py", line 271, in main
log.txt:            defaultMemory=defaultMemory, defaultCpu=defaultCpu, depth=depth)
log.txt:          File "/usr/local/progressiveCactus/submodules/jobTree/scriptTree/stack.py", line 153, in execute
log.txt:            self.target.run()
log.txt:          File "/usr/local/progressiveCactus/submodules/cactus/pipeline/ktserverJobTree.py", line 139, in run
log.txt:            killPingInterval=self.runTimestep)
log.txt:          File "/usr/local/progressiveCactus/submodules/cactus/pipeline/ktserverControl.py", line 130, in runKtserver
log.txt:            raise e
log.txt:        OSError: [Errno 2] No such file or directory
log.txt:        Exiting the slave because of a failed job on host 0c72ee4d1b74
log.txt:        Due to failure we are reducing the remaining retry count of job /usr/local/progressiveCactus/work/jobTree/jobs/t0/job to 0
log.txt:        We have set the default memory of the failed job to 34359738368 bytes
Job: /usr/local/progressiveCactus/work/jobTree/jobs/t0/job is completely failed
The job seems to have left a log file, indicating failure: /usr/local/progressiveCactus/work/jobTree/jobs/t1/job
Reporting file: /usr/local/progressiveCactus/work/jobTree/jobs/t1/log.txt
log.txt:        ---JOBTREE SLAVE OUTPUT LOG---
log.txt:        Traceback (most recent call last):
log.txt:          File "/usr/local/progressiveCactus/submodules/jobTree/src/jobTreeSlave.py", line 271, in main
log.txt:            defaultMemory=defaultMemory, defaultCpu=defaultCpu, depth=depth)
log.txt:          File "/usr/local/progressiveCactus/submodules/jobTree/scriptTree/stack.py", line 153, in execute
log.txt:            self.target.run()
log.txt:          File "/usr/local/progressiveCactus/submodules/cactus/pipeline/ktserverJobTree.py", line 172, in run
log.txt:            self.blockTimeout, self.blockTimestep)
log.txt:          File "/usr/local/progressiveCactus/submodules/cactus/pipeline/ktserverControl.py", line 225, in blockUntilKtserverIsRunnning
log.txt:            killSwitchPath):
log.txt:          File "/usr/local/progressiveCactus/submodules/cactus/pipeline/ktserverControl.py", line 291, in __isKtServerRunning
log.txt:            killSwitchPath)
log.txt:          File "/usr/local/progressiveCactus/submodules/cactus/pipeline/ktserverControl.py", line 206, in __readStatusFromSwitchFile
log.txt:            raise RuntimeError("Ktserver polling detected fatal error")
log.txt:        RuntimeError: Ktserver polling detected fatal error
log.txt:        Exiting the slave because of a failed job on host 0c72ee4d1b74
log.txt:        Due to failure we are reducing the remaining retry count of job /usr/local/progressiveCactus/work/jobTree/jobs/t1/job to 0
log.txt:        We have set the default memory of the failed job to 2147483648 bytes
Job: /usr/local/progressiveCactus/work/jobTree/jobs/t1/job is completely failed

2018-01-19 03:55:30.827372: Finished Progressive Cactus Alignment

The exception being raised from __readStatusFromSwitchFile. It looks like jobTree reads the file at killSwitchPath (something like work/jobTree/jobs/gTD9/tmp_u5ZrcIvgWc/tmp_j5YFSaAXyD_kill.txt , trying to pull out port, serverPid etc. In all of my failed runs, the file at killSwitchPath looks like:

-1
-1
-1

These values, when read raise the exception.

I think that these are being written as a flag when spinning up the runKtserver. Somewhere in here, an exception is being thrown, which is caught by the blanket except Exception as e.

Does anybody have any ideas why the KtServer isn't being spun up correctly? Any pointers to help me debug?

joelarmstrong commented 6 years ago

Hey Robert,

The kill-switch file thing is a horrendous (though necessary at the time) hack to fit a database into jobTree's model. Basically, there is one job to run the DB, and another to actually wait till it starts up properly: the -1s are written as a flag to communicate that the ktserver died while starting. So the interesting part is actually the OSError raised earlier.

Sadly the original traceback is swallowed by the re-raise, but I think the most likely source of that OSError is that the 'ktserver' executable is missing or not in the PATH. I'm not exactly sure why that would be the case--everything looks fine to me--but I'll try building from your Dockerfile and see if I can help debug.

robsyme commented 6 years ago

Superstar. I'll also try chasing ktserver down from my end.

As long as I source environment, ktserver is in the $PATH. The problem looks like it's during __validateKtserver

joelarmstrong commented 6 years ago

The OSError is actually when trying to run ping. I don't think it crossed anyone's mind that ping wouldn't be present, but it actually makes sense that it wouldn't be in a minimal Docker image. If you add iputils-ping to the your list of packages to install, you should be OK to go.

PS: we are starting to transition over to a Toil version of cactus, available https://github.com/ComparativeGenomicsToolkit/cactus. It unwinds a few (not enough!) of the hacky bits, because Toil supports a bunch of features that jobTree didn't. But if you can get progressiveCactus to work, that's good too!

robsyme commented 6 years ago

Gar, you just beat me to it! Nothing like building containers to get a good handle on the dependency assumptions ;)

Thanks for your help

robsyme commented 6 years ago

Hilariously, you also need time installed - not as a shell keyword (the bash default) but as a binary so that sh can run it. It's getting ridiculous now, and I don't mean to be a pedant, but I thought you (and those that are reading this in the future) might want to know.

joelarmstrong commented 6 years ago

Thanks! Yep, there are a lot of hidden assumptions about what we can expect the "minimal" unix install to have--but those are getting broken by how much people are squeezing down these popular base Docker images.

glennhickey / progressiveCactus

Unable to run example with minimal install #101