markziemann / dee2

Digital Expression Explorer 2 (DEE2): a repository of uniformly processed RNA-seq data
http://dee2.io
GNU General Public License v3.0
39 stars 7 forks source link

Docker build fails at RUN pip3 install parallel-fastq-dump #98

Closed maciejmotyka closed 2 years ago

maciejmotyka commented 2 years ago

Hi Mark,

I was trying to rebuild the Docker image from scratch and it fails. I think that it's because pip upgrades itself to a version that is no longer compatible with Python 3.5 that comes with Ubuntu 16.04.

pip dropped support for Python 3.5 at version 21.0, see the changelog: https://pip.pypa.io/en/stable/news/#v21-0
and also in this thread where they had similar error message https://stackoverflow.com/questions/65869296/installing-pip-is-not-working-in-python-3-6

Below are selected relevant lines from docker build output:

Step 22/34 : RUN pip3 install --upgrade pip
(...)
Successfully installed pip-22.0.4
(...)
Step 23/34 : RUN pip3 install parallel-fastq-dump
 ---> Running in 4fd16f79fd58
Traceback (most recent call last):
  File "/usr/local/bin/pip3", line 7, in <module>
    from pip._internal.cli.main import main
  File "/usr/local/lib/python3.5/dist-packages/pip/_internal/cli/main.py", line 57
    sys.stderr.write(f"ERROR: {exc}")
                                   ^
SyntaxError: invalid syntax

Could you see if it fails for you as well?

markziemann commented 2 years ago

Hello, I've not tried to build the Docker image recently, as it is simply easier to pull it from Docker hub. Could you try that?

maciejmotyka commented 2 years ago

Pulling image from docker hub works. I wanted to re-build the image with a newer version of the SRA Toolkit, because prefetch has been extremely unreliable lately, causing the pipeline to fail constantly. I'm not sure if it's my network or they changed something at SRA database, but prefetch fails 9/10 times when downloading ~500mb file, e.g. SRR2637697. I have the newest version of prefetch locally, and it takes whole 5 minutes to download this run over http, but at least it doesn't fail.

Anyway, I fixed the pip Python conflict by starting from Ubuntu 18.04, but then it crashed due to invalid link to STAR binary at:

Step 26/34 : RUN   cd sw &&   wget -c "https://github.com/alexdobin/STAR/raw/master/bin/Linux_x86_64_static/STAR" &&   chmod +x STAR &&   cp STAR /usr/local/bin/STAR

I think I'll just write a script to download all necessary runs locally and then copy them over to the DEE2 container for processing.

markziemann commented 2 years ago

Another way to circonvent this problem without altering the docker image is to prefetch first using whichever SRA toolkit version you like, followed by running the docker image with the -d parameter which searches the current working directory for sra archives.

docker run -v $(pwd):/dee2/mnt mziemann/tallyup hsapiens -d

I have been running it this way on our HPC as it keeps the CPUs busier. Could you give this a try?

Use prefetch like this

prefetch -X 9999999999999 -o ${ORG}_${SRR}.sra $SRR

where SRR is the run accession, and ORG is the species eg: hsapiens

maciejmotyka commented 2 years ago

Thanks. That's ulitmately what I ended doing.

I've set up a downloader pod on fast-network node with the latest SRA-Tools docker from https://hub.docker.com/r/ncbi/sra-tools and another pod running the pipeline image on high-memory node. The downloader pod saves SRA files to the same storage directory that the pipeline pod has mounted under /dee2/mnt and they can work asynchronously.