Closed Phillip-a-richmond closed 2 years ago
Hi Phillip,
DeepVariant docker contains a prebuilt version of TF for GPU. Could you try to run it in Conda without pre-existing TF?
Removing the conda environment gives me the same error.
@Phillip-a-richmond Thanks for checking. I can take a look today. Before I made the release, I'm pretty sure I checked Singularity+GPU worked, but I should check again. I'll get a GPU machine and see if I can reproduce the errors you're seeing.
I've got 1.1-gpu working so I don't think it's an issue with my CUDA. Testing 1.2 now.
1.2 produces same error
2022-02-10 12:57:29.123141: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
Traceback (most recent call last):
File "/opt/deepvariant/bin/run_deepvariant.py", line 48, in <module>
import tensorflow as tf
File "/home/BCRICWH.LAN/prichmond/.local/lib/python3.8/site-packages/tensorflow/__init__.py", line 444, in <module>
_ll.load_library(_main_dir)
File "/home/BCRICWH.LAN/prichmond/.local/lib/python3.8/site-packages/tensorflow/python/framework/load_library.py", line 154, in load_library
py_tf.TF_LoadLibrary(lib)
tensorflow.python.framework.errors_impl.NotFoundError: /usr/local/lib/python3.8/dist-packages/tensorflow/core/kernels/libtfkernel_sobol_op.so: undefined symbol: _ZNK10tensorflow8OpKernel11TraceStringERKNS_15OpKernelContextEb
ok, here are my steps:
I used the command here: https://github.com/google/deepvariant/blob/r1.3/docs/deepvariant-details.md#command-for-a-gpu-machine-on-google-cloud-platform
My machine:
pichuan@pichuan-gpu:~$ uname -a
Linux pichuan-gpu 5.11.0-1029-gcp #33~20.04.3-Ubuntu SMP Tue Jan 18 12:03:29 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
curl https://raw.githubusercontent.com/google/deepvariant/r1.3/scripts/install_nvidia_docker.sh | bash
curl https://raw.githubusercontent.com/google/deepvariant/r1.3/scripts/install_singularity.sh | bash
Singularity version:
pichuan@pichuan-gpu:~$ singularity --version
singularity version 3.7.0
I followed the steps in https://github.com/google/deepvariant/blob/r1.3/docs/deepvariant-quick-start.md to get small test data.
# Pull the image.
BIN_VERSION=1.3.0
singularity pull docker://google/deepvariant:"${BIN_VERSION}-gpu"
# Run DeepVariant.
# Using "--nv" and "${BIN_VERSION}-gpu" is important.
singularity run --nv -B /usr/lib/locale/:/usr/lib/locale/ \
docker://google/deepvariant:"${BIN_VERSION}-gpu" \
/opt/deepvariant/bin/run_deepvariant \
--model_type=WGS \
--ref="${INPUT_DIR}"/ucsc.hg19.chr20.unittest.fasta \
--reads="${INPUT_DIR}"/NA12878_S1.chr20.10_10p1mb.bam \
--regions "chr20:10,000,000-10,010,000" \
--output_vcf="${OUTPUT_DIR}"/output.vcf.gz \
--output_gvcf="${OUTPUT_DIR}"/output.g.vcf.gz \
--intermediate_results_dir "${OUTPUT_DIR}/intermediate_results_dir" \
--num_shards=$(nproc)
The command above worked, so I copy/pasted the command from the original post:
singularity run -B /usr/lib/locale/:/usr/lib/locale/ \
--nv \
docker://google/deepvariant:"${BIN_VERSION}-gpu" \
/opt/deepvariant/bin/run_deepvariant \
--model_type=WGS \
--ref="${INPUT_DIR}"/ucsc.hg19.chr20.unittest.fasta \
--reads="${INPUT_DIR}"/NA12878_S1.chr20.10_10p1mb.bam \
--regions "chr20:10,000,000-10,010,000" \
--output_vcf="${OUTPUT_DIR}"/output.vcf.gz \
--output_gvcf="${OUTPUT_DIR}"/output.g.vcf.gz \
--intermediate_results_dir "${OUTPUT_DIR}/intermediate_results_dir"
which also seems to work.
This command below shows my TensorFlow version:
pichuan@pichuan-gpu:~$ singularity run --nv -B /usr/lib/locale/:/usr/lib/locale/ docker://google/deepvariant:"${BIN_VERSION}-gpu" python -c 'import tensorflow as tf; print(tf.__version__)'
INFO: Using cached SIF image
2022-02-10 23:13:05.337920: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
2.5.0
To confirm the path:
pichuan@pichuan-gpu:~$ singularity run --nv -B /usr/lib/locale/:/usr/lib/locale/ docker://google/deepvariant:"${BIN_VERSION}-gpu" python -c 'import tensorflow as tf; print(tf.__file__)'
INFO: Using cached SIF image
2022-02-10 23:12:22.632481: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
/usr/local/lib/python3.8/dist-packages/tensorflow/__init__.py
I have to say I don't really know how Singularity works, but here are a few commands I tried:
pichuan@pichuan-gpu:~$ singularity run --nv -B /usr/lib/locale/:/usr/lib/locale/ docker://google/deepvariant:"${BIN_VERSION}-gpu" ls /usr/local/lib/python3.8/dist-packages/tensorflow
INFO: Using cached SIF image
__init__.py __pycache__ _api compiler core include keras libtensorflow_framework.so.2 lite python tools xla_aot_runtime_src
pichuan@pichuan-gpu:~$ ls /usr/local/lib/python3.8/dist-packages/tensorflow
ls: cannot access '/usr/local/lib/python3.8/dist-packages/tensorflow': No such file or directory
Not really sure how helpful this is. @Phillip-a-richmond if you spot any differences, let me know how I can change to reproduce the error.
Hi @Phillip-a-richmond , if you have any suggestions on how to reproduce this issue, please let me know. I'll close this for now.
I was able to get around this issue with my version of singularity (3.4.2) by cleaning the environment, limiting what's passed to singularity from the environment, and setting the tmp dir explicitly in the working directory on the NFS.
here's my code chunk:
WORKING_DIR=/mnt/scratch/Precision/Hub/PROCESS/DH4749/
export SINGULARITY_CACHEDIR=$WORKING_DIR
export SINGULARITY_TMPDIR=$WORKING_DIR/tmp/
mkdir -p $WORKING_DIR/tmp/
singularity exec \
-e \
-c \
-H $WORKING_DIR \
-B $WORKING_DIR/tmp:/tmp \
-B /usr/lib/locale/:/usr/lib/locale/ \
-B "${BAM_DIR}":"/bamdir" \
-B "${FASTA_DIR}":"/genomedir" \
-B "${OUTPUT_DIR}":"/output" \
docker://google/deepvariant:"${BIN_VERSION}" \
/opt/deepvariant/bin/run_deepvariant \
--model_type=WES \
--ref="/genomedir/$FASTA_FILE" \
--reads="/bamdir/$PROBAND_BAM" \
--output_vcf="/output/$PROBAND_VCF" \
--output_gvcf="/output/$PROBAND_GVCF" \
--intermediate_results_dir="/output/intermediate" \
--num_shards=$NSLOTS
With the newer versions of singularity I think they do less inclusion of environmental variables, which includes the PYTHONPATH among other things in home directory and /usr/local/src...which is why you couldn't reproduce the error on a fresh cloud deployment.
Can keep closed just figured it out on my end...may be useful to someone with same issue on shared HPC with older singularity versions.
Hello,
I'm trying to debug my installation of the singularity GPU version for a new C4140 GPU node with Tesla V100s. I've run the CPU version successfully in production and am very happy with it, but the shift to GPU is giving me trouble, likely running into an issue with CUDA or TensorFlow.
I have several CUDA modules loaded, but perhaps I'm missing one of the key libraries? I have TensorFlow in a conda environment (although that's probably satisfied inside the singularity image)?
Here's the code I'm running from the Quickstart:
And here's my error:
I'm wondering if this error can help highlight the error I'm experiencing?
Is there something I can run with CUDA to test that implementation on our new GPU server?
Thanks! Phil