Phillip-a-richmond commented 2 years ago

Hello,

I'm trying to debug my installation of the singularity GPU version for a new C4140 GPU node with Tesla V100s. I've run the CPU version successfully in production and am very happy with it, but the shift to GPU is giving me trouble, likely running into an issue with CUDA or TensorFlow.

I have several CUDA modules loaded, but perhaps I'm missing one of the key libraries? I have TensorFlow in a conda environment (although that's probably satisfied inside the singularity image)?

Here's the code I'm running from the Quickstart:

OUTPUT_DIR="${PWD}/quickstart-output"
INPUT_DIR="${PWD}/quickstart-testdata"
mkdir -p "${OUTPUT_DIR}"

BIN_VERSION="1.3.0"

# Load modules
module load singularity
module load cuda-dcgm/2.2.9.1
module load cuda11.4/toolkit
module load cuda11.4/blas
module load cuda11.4/nsight
module load cuda11.4/profiler
module load cuda11.4/fft
source /mnt/common/Precision/Miniconda3/opt/miniconda3/etc/profile.d/conda.sh
conda activate TensorFlow_GPU

# Pull the image.
singularity pull docker://google/deepvariant:"${BIN_VERSION}-gpu"

# Run
singularity run -B /usr/lib/locale/:/usr/lib/locale/ \
  --nv \
  docker://google/deepvariant:"${BIN_VERSION}-gpu" \
  /opt/deepvariant/bin/run_deepvariant \
  --model_type=WGS \
  --ref="${INPUT_DIR}"/ucsc.hg19.chr20.unittest.fasta \
  --reads="${INPUT_DIR}"/NA12878_S1.chr20.10_10p1mb.bam \
  --regions "chr20:10,000,000-10,010,000" \
  --output_vcf="${OUTPUT_DIR}"/output.vcf.gz \
  --output_gvcf="${OUTPUT_DIR}"/output.g.vcf.gz \
  --intermediate_results_dir "${OUTPUT_DIR}/intermediate_results_dir"

And here's my error:

2022-02-07 11:50:52.952780: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
Traceback (most recent call last):
  File "/opt/deepvariant/bin/run_deepvariant.py", line 48, in <module>
    import tensorflow as tf
  File "/home/BCRICWH.LAN/prichmond/.local/lib/python3.8/site-packages/tensorflow/__init__.py", line 444, in <module>
    _ll.load_library(_main_dir)
  File "/home/BCRICWH.LAN/prichmond/.local/lib/python3.8/site-packages/tensorflow/python/framework/load_library.py", line 154, in load_library
    py_tf.TF_LoadLibrary(lib)
tensorflow.python.framework.errors_impl.NotFoundError: /usr/local/lib/python3.8/dist-packages/tensorflow/core/kernels/libtfkernel_sobol_op.so: undefined symbol: _ZNK10tensorflow8OpKernel11TraceStringERKNS_15OpKernelContextEb

I'm wondering if this error can help highlight the error I'm experiencing?

Is there something I can run with CUDA to test that implementation on our new GPU server?

Thanks! Phil

akolesnikov commented 2 years ago

Hi Phillip,

DeepVariant docker contains a prebuilt version of TF for GPU. Could you try to run it in Conda without pre-existing TF?

Phillip-a-richmond commented 2 years ago

Removing the conda environment gives me the same error.

pichuan commented 2 years ago

@Phillip-a-richmond Thanks for checking. I can take a look today. Before I made the release, I'm pretty sure I checked Singularity+GPU worked, but I should check again. I'll get a GPU machine and see if I can reproduce the errors you're seeing.

Phillip-a-richmond commented 2 years ago

I've got 1.1-gpu working so I don't think it's an issue with my CUDA. Testing 1.2 now.

Phillip-a-richmond commented 2 years ago

1.2 produces same error

2022-02-10 12:57:29.123141: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
Traceback (most recent call last):
  File "/opt/deepvariant/bin/run_deepvariant.py", line 48, in <module>
    import tensorflow as tf
  File "/home/BCRICWH.LAN/prichmond/.local/lib/python3.8/site-packages/tensorflow/__init__.py", line 444, in <module>
    _ll.load_library(_main_dir)
  File "/home/BCRICWH.LAN/prichmond/.local/lib/python3.8/site-packages/tensorflow/python/framework/load_library.py", line 154, in load_library
    py_tf.TF_LoadLibrary(lib)
tensorflow.python.framework.errors_impl.NotFoundError: /usr/local/lib/python3.8/dist-packages/tensorflow/core/kernels/libtfkernel_sobol_op.so: undefined symbol: _ZNK10tensorflow8OpKernel11TraceStringERKNS_15OpKernelContextEb

pichuan commented 2 years ago

ok, here are my steps:

Get a GPU machine.

I used the command here: https://github.com/google/deepvariant/blob/r1.3/docs/deepvariant-details.md#command-for-a-gpu-machine-on-google-cloud-platform

My machine:

pichuan@pichuan-gpu:~$ uname -a
Linux pichuan-gpu 5.11.0-1029-gcp #33~20.04.3-Ubuntu SMP Tue Jan 18 12:03:29 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux

Install GPU driver and Singularity on the machine:

curl https://raw.githubusercontent.com/google/deepvariant/r1.3/scripts/install_nvidia_docker.sh | bash
curl https://raw.githubusercontent.com/google/deepvariant/r1.3/scripts/install_singularity.sh | bash

Singularity version:

pichuan@pichuan-gpu:~$ singularity --version
singularity version 3.7.0

Got the test data from Quick Start

I followed the steps in https://github.com/google/deepvariant/blob/r1.3/docs/deepvariant-quick-start.md to get small test data.

Run Singularity

# Pull the image.
BIN_VERSION=1.3.0
singularity pull docker://google/deepvariant:"${BIN_VERSION}-gpu"

# Run DeepVariant.
# Using "--nv" and "${BIN_VERSION}-gpu" is important.
singularity run --nv -B /usr/lib/locale/:/usr/lib/locale/ \
  docker://google/deepvariant:"${BIN_VERSION}-gpu" \
  /opt/deepvariant/bin/run_deepvariant \
  --model_type=WGS \
  --ref="${INPUT_DIR}"/ucsc.hg19.chr20.unittest.fasta \
  --reads="${INPUT_DIR}"/NA12878_S1.chr20.10_10p1mb.bam \
  --regions "chr20:10,000,000-10,010,000" \
  --output_vcf="${OUTPUT_DIR}"/output.vcf.gz \
  --output_gvcf="${OUTPUT_DIR}"/output.g.vcf.gz \
  --intermediate_results_dir "${OUTPUT_DIR}/intermediate_results_dir" \
  --num_shards=$(nproc)

The command above worked, so I copy/pasted the command from the original post:

singularity run -B /usr/lib/locale/:/usr/lib/locale/ \
  --nv \
  docker://google/deepvariant:"${BIN_VERSION}-gpu" \
  /opt/deepvariant/bin/run_deepvariant \
  --model_type=WGS \
  --ref="${INPUT_DIR}"/ucsc.hg19.chr20.unittest.fasta \
  --reads="${INPUT_DIR}"/NA12878_S1.chr20.10_10p1mb.bam \
  --regions "chr20:10,000,000-10,010,000" \
  --output_vcf="${OUTPUT_DIR}"/output.vcf.gz \
  --output_gvcf="${OUTPUT_DIR}"/output.g.vcf.gz \
  --intermediate_results_dir "${OUTPUT_DIR}/intermediate_results_dir"

which also seems to work.

This command below shows my TensorFlow version:

pichuan@pichuan-gpu:~$ singularity run --nv -B /usr/lib/locale/:/usr/lib/locale/   docker://google/deepvariant:"${BIN_VERSION}-gpu"   python -c 'import tensorflow as tf; print(tf.__version__)'
INFO:    Using cached SIF image
2022-02-10 23:13:05.337920: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
2.5.0

To confirm the path:

pichuan@pichuan-gpu:~$ singularity run --nv -B /usr/lib/locale/:/usr/lib/locale/   docker://google/deepvariant:"${BIN_VERSION}-gpu"   python -c 'import tensorflow as tf; print(tf.__file__)'
INFO:    Using cached SIF image
2022-02-10 23:12:22.632481: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
/usr/local/lib/python3.8/dist-packages/tensorflow/__init__.py

I have to say I don't really know how Singularity works, but here are a few commands I tried:

pichuan@pichuan-gpu:~$ singularity run --nv -B /usr/lib/locale/:/usr/lib/locale/   docker://google/deepvariant:"${BIN_VERSION}-gpu"   ls /usr/local/lib/python3.8/dist-packages/tensorflow
INFO:    Using cached SIF image
__init__.py  __pycache__  _api  compiler  core  include  keras  libtensorflow_framework.so.2  lite  python  tools  xla_aot_runtime_src

pichuan@pichuan-gpu:~$ ls /usr/local/lib/python3.8/dist-packages/tensorflow
ls: cannot access '/usr/local/lib/python3.8/dist-packages/tensorflow': No such file or directory

Not really sure how helpful this is. @Phillip-a-richmond if you spot any differences, let me know how I can change to reproduce the error.

pichuan commented 2 years ago

Hi @Phillip-a-richmond , if you have any suggestions on how to reproduce this issue, please let me know. I'll close this for now.

Phillip-a-richmond commented 2 years ago

I was able to get around this issue with my version of singularity (3.4.2) by cleaning the environment, limiting what's passed to singularity from the environment, and setting the tmp dir explicitly in the working directory on the NFS.

here's my code chunk:

WORKING_DIR=/mnt/scratch/Precision/Hub/PROCESS/DH4749/
export SINGULARITY_CACHEDIR=$WORKING_DIR
export SINGULARITY_TMPDIR=$WORKING_DIR/tmp/
mkdir -p $WORKING_DIR/tmp/

singularity exec \
    -e \
    -c \
    -H $WORKING_DIR \
    -B $WORKING_DIR/tmp:/tmp \
    -B /usr/lib/locale/:/usr/lib/locale/ \
    -B "${BAM_DIR}":"/bamdir" \
    -B "${FASTA_DIR}":"/genomedir" \
    -B "${OUTPUT_DIR}":"/output" \
    docker://google/deepvariant:"${BIN_VERSION}" \
  /opt/deepvariant/bin/run_deepvariant \
  --model_type=WES \
  --ref="/genomedir/$FASTA_FILE" \
  --reads="/bamdir/$PROBAND_BAM" \
  --output_vcf="/output/$PROBAND_VCF" \
  --output_gvcf="/output/$PROBAND_GVCF" \
  --intermediate_results_dir="/output/intermediate" \
  --num_shards=$NSLOTS

With the newer versions of singularity I think they do less inclusion of environmental variables, which includes the PYTHONPATH among other things in home directory and /usr/local/src...which is why you couldn't reproduce the error on a fresh cloud deployment.

Can keep closed just figured it out on my end...may be useful to someone with same issue on shared HPC with older singularity versions.

google / deepvariant

Issue with singularity gpu #514

Get a GPU machine.

Install GPU driver and Singularity on the machine:

Got the test data from Quick Start

Run Singularity