nanoporetech / dorado

Oxford Nanopore's Basecaller
https://nanoporetech.com/
Other
493 stars 59 forks source link

Dorado 0.7.0 duplex with modified bases: throws '[error] basic_string::substr: __pos' and may abort #855

Open janpb opened 4 months ago

janpb commented 4 months ago

Issue Report

Please describe the issue:

Running dorado 0.7.0 duplex with v5.0.0 and 4.3.0 version models and modified base models shows multiples errors like this: [error] basic_string::substr: __pos (which is 79687) > this->size() (which is 4725). Some basecalling runs do finish, some are getting aborted.

Simplex runs are finishing without error as well as using dorado 0.6.2. I did use the same data set for all runs: 10 pod5 files from a Arabidopsis thaliana sequencing run

We tested three GPUs on tree different machines. Same results.

Steps to reproduce the issue:

dorado duplex --recursive --device "cuda:1" sup@5.0.0,MODBASEMOD > out.bam 2>out.log
dorado basecaller --recursive --device "cuda:1" sup@5.0.0,MODBASEMOD > out.bam 2>out.log

Run environment:

Logs

Pasting the error from one run:

[2024-05-29 15:49:51.221] [info] Running: "duplex" "--recursive" "--device" "cuda:1" "sup@v4.3.0,6mA,5mC_5hmC" "/mnt/local/data/jan/dorado-benchmark/datasets/testset-smaller"
[2024-05-29 15:49:51.225] [info] > No duplex pairs file provided, pairing will be performed automatically
[2024-05-29 15:49:51.454] [info]  - downloading dna_r10.4.1_e8.2_400bps_sup@v4.3.0 with httplib
[2024-05-29 15:50:04.512] [info]  - downloading dna_r10.4.1_e8.2_5khz_stereo@v1.2 with httplib
[2024-05-29 15:50:09.899] [info]  - downloading dna_r10.4.1_e8.2_400bps_sup@v4.3.0_6mA@v2 with httplib
[2024-05-29 15:50:10.841] [info]  - downloading dna_r10.4.1_e8.2_400bps_sup@v4.3.0_5mC_5hmC@v1 with httplib
[2024-05-29 15:50:18.012] [info] Normalised: chunksize 10000 -> 9996
[2024-05-29 15:50:18.012] [info] Normalised: overlap 500 -> 498
[2024-05-29 15:50:30.699] [info] cuda:1 using chunk size 9996, batch size 2496
[2024-05-29 15:50:38.592] [info] cuda:1 using chunk size 10000, batch size 4864
[2024-05-29 15:50:40.390] [info] > Starting Stereo Duplex pipeline
[2024-05-29 15:50:40.405] [info] > Reading read channel info
[2024-05-29 15:50:40.445] [info] > Processed read channel info
[2024-05-29 15:59:42.145] [error] basic_string::substr: __pos (which is 46418) > this->size() (which is 4725)
[2024-05-29 15:59:42.150] [error] basic_string::substr: __pos (which is 7801) > this->size() (which is 6371)
[2024-05-29 15:59:43.649] [error] basic_string::substr: __pos (which is 33987) > this->size() (which is 8177)
[2024-05-29 15:59:46.188] [error] basic_string::substr: __pos (which is 64629) > this->size() (which is 1350)
[2024-05-29 15:59:46.405] [error] basic_string::substr: __pos (which is 91037) > this->size() (which is 7476)
[2024-05-29 15:59:47.153] [error] basic_string::substr: __pos (which is 10764) > this->size() (which is 4783)
[2024-05-29 15:59:48.232] [error] basic_string::substr: __pos (which is 47536) > this->size() (which is 47107)
[2024-05-29 15:59:48.332] [error] basic_string::substr: __pos (which is 8459) > this->size() (which is 8196)
[2024-05-29 15:59:51.924] [error] basic_string::substr: __pos (which is 8432) > this->size() (which is 1810)
...
[2024-05-29 16:15:13.990] [info] > Simplex reads basecalled: 39960
[2024-05-29 16:15:13.990] [info] > Simplex reads filtered: 43
[2024-05-29 16:15:13.990] [info] > Duplex reads basecalled: 1278
[2024-05-29 16:15:13.990] [info] > Duplex rate: 7.684175%
[2024-05-29 16:15:13.990] [info] > Basecalled @ Bases/s: 2.758547e+05
HalfPhoton commented 4 months ago

Hi @janpb, We've seen this issue in a couple of places now - and we're investigating. Thanks for the detailed report.

malton-ont commented 4 months ago

Hi @janpb,

Does this still occur if you don't perform modbase detection?

janpb commented 4 months ago

Dear @malton-ont ,

forgot to add this. No errors without modbase detection.

malton-ont commented 4 months ago

@janpb, thanks for confirming that, that tracks with what I can see. Are you able to isolate any specific reads which cause this problem that you would be able to share with us?

janpb commented 3 months ago

@malton-ont I can't share the original data, but I can share another pod5 file showing the same error. Please let me know how to get it to you.

malton-ont commented 3 months ago

@janpb, if the file is small enough and you're happy to do so you can just attach it to this issue.

janpb commented 3 months ago

@malton-ont the file size is 475 Mb, I assume this is too much as attachment for an issue.

malton-ont commented 3 months ago

@janpb, yes, that would be a bit large. Is there any chance you could subset it down to just a few reads that show the issue? Otherwise I'll check where we can upload something that large.

janpb commented 3 months ago

I added two small datasets with 250 reads and corresponding dorado -vv logfile. Both subsets originated from the same pod5 file:

malton-ont commented 3 months ago

That's great, thank you! I'll take a look and hopefully we can get a fix in.

caspargross commented 3 months ago

Possibly related to my issue here: https://github.com/nanoporetech/dorado/issues/850

alesanzdro commented 3 months ago

In your issue @caspargross , you mentioned that you resolved the "[error] basic_string::substr: __pos" error.

Do you think this was a result of modifying the parameters "--chunksize 2000 --batchsize 5000"?

We are conducting our tests on a 2 x (NVIDIA GeForce RTX 3080 Ti 12GB GDDR6X 10240 CUDA core), which has much less RAM compared to the NVIDIA H100 with 80GB of RAM (with a very high memory bandwidth of up to 2 TB/s).

In our case, it is not feasible to test with that batchsize value due to technical limitations, and I fear that the RAM limitation might be the cause.

However, I am hopeful that the error is due to some adjustment in the models or an internal CUDA issue since we are not currently having any problems running the sup model.

We started seeing the errors after using version 0.7 and the sup 5.0 model for modified bases (I'm not entirely sure if it's the same error), and in your case, using the 4.1 model and version 0.7.1 might indicate that the problem arises from version 0.7 rather than the models themselves.

We were wondering if the issue could be due to:

I am showing you an image of a brief search for that error, in some tests we have conducted for versions 0.7 and 0.7.1:

Captura desde 2024-06-07 15-18-27

If 200-800 "basic_string::substr: __pos" errors only imply that 800 reads have not been correctly assigned, it wouldn't be such a severe error. However, the fact that sometimes it finishes and sometimes it doesn't makes me uneasy about working with that data, as the information might not have been extracted correctly.

We have tested with updated versions of NVIDIA (555.42.02) and CUDA (12.5) and continue to observe the same issues; some runs finish, others don't.

Captura desde 2024-06-11 09-18-21

The latest tests were done using CUDA 11.8, as used in the DORADO development version, in Docker (with NVIDIA version 555.42.02 on my system).

I am sharing the Dockerfile in case it is helpful to anyone:

## SETUP - RUN
############################################################################
# docker build -t dorado_0.7.1 .
# docker run --gpus all -it dorado_0.7.1:latest /opt/bin/dorado -h

## In case it does not properly recognize the GPUs in Docker
############################################################################
# curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | apt-key add -
# curl -s -L https://nvidia.github.io/nvidia-docker/ubuntu22.04/nvidia-docker.list > /etc/apt/sources.list.d/nvidia-docker.list
# apt update
# apt -y install nvidia-container-toolkit
# systemctl restart docker

# Use the NVIDIA base image with CUDA 11.8 and cuDNN 8, including nvcc
FROM nvidia/cuda:11.8.0-devel-ubuntu22.04

# Set non-interactive environment to avoid prompts during package installation
ENV DEBIAN_FRONTEND=noninteractive

# Configure environment variables for the proxy
#ENV http_proxy="http://xxxx:8080"
#ENV https_proxy="http://xxxx:8080"
#ENV HTTP_PROXY="http://xxxx:8080"
#ENV HTTPS_PROXY="http://xxxx:8080"

# Install necessary dependencies
RUN apt-get update && apt-get install -y --no-install-recommends \
        curl \
        git \
        ca-certificates \
        build-essential \
        libhdf5-dev \
        libssl-dev \
        libzstd-dev \
        autoconf \
        automake \
        python3-venv \
        gcc-9 \
        g++-9 \
        wget \
        kmod \
    && update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-9 60 --slave /usr/bin/g++ g++ /usr/bin/g++-9 \
    && rm -rf /var/lib/apt/lists/*

# Configure Python and pip for the virtual environment
RUN python3 -m venv /opt/venv
ENV PATH="/opt/venv/bin:$PATH"
RUN pip install --upgrade pip

# Install pre-commit and cmake
RUN pip install pre-commit "cmake>=3.23"

# Clone the Dorado repository
WORKDIR /workspace
RUN git clone https://github.com/nanoporetech/dorado.git
WORKDIR /workspace/dorado

# Configure pre-commit
RUN pre-commit install

# Set the necessary environment variables for CUDA
ENV PATH="/usr/local/cuda-11.8/bin:$PATH"
ENV LD_LIBRARY_PATH="/usr/local/cuda-11.8/lib64:$LD_LIBRARY_PATH"

# Build the Dorado project
RUN cmake -DCUDAToolkit_ROOT=/usr/local/cuda-11.8 -S . -B cmake-build \
    && cmake --build cmake-build --config Release -j \
    && ctest --test-dir cmake-build

# Install Dorado
RUN cmake --install cmake-build --prefix /opt

# Configure the environment variable for PyTorch
ENV PYTORCH_CUDA_ALLOC_CONF="max_split_size_mb:128"

# Command to run Dorado
CMD ["/opt/bin/dorado"]
caspargross commented 3 months ago

Hi @alesanzdro,

this is only a workaround that I posted to help finding the bug, not a real fix. If I understand correctly you should always let the basecaller find the optimal batchsize. @vellamike mentionned here here that choosing an incorrect batchsize impacts basecalling accuracy. Using chunked data the batchsize estimation makes up a large part of the total runtime, so if we could somehow avoid it we would save a lot of time.

I think the only real solution is doing duplex basecalling without modifications at the moment or using an older dorado version.

mjlarsen commented 3 months ago

Any news about this issue. I am having the extact same issue iwth Dorado v0.7.2.

[2024-06-24 02:46:07.024] [error] basic_string::substr: __pos (which is 13683) > this->size() (which is 3774)
[2024-06-24 02:46:07.064] [error] basic_string::substr: __pos (which is 58131) > this->size() (which is 5828)
[2024-06-24 02:46:08.251] [error] basic_string::substr: __pos (which is 4187) > this->size() (which is 965)
[2024-06-24 02:46:08.262] [error] basic_string::substr: __pos (which is 3297) > this->size() (which is 534)
[2024-06-24 02:46:08.267] [error] basic_string::substr: __pos (which is 24041) > this->size() (which is 2215)
[2024-06-24 02:46:08.309] [error] basic_string::substr: __pos (which is 1309) > this->size() (which is 719)
[2024-06-24 02:46:08.319] [error] basic_string::substr: __pos (which is 25600) > this->size() (which is 9243)
[2024-06-24 02:46:08.360] [error] basic_string::substr: __pos (which is 49098) > this->size() (which is 4319)
[2024-06-24 02:46:09.165] [error] basic_string::substr: __pos (which is 22403) > this->size() (which is 2193)
[2024-06-24 02:46:09.218] [error] basic_string::substr: __pos (which is 30541) > this->size() (which is 19041)
[2024-06-24 02:46:09.236] [error] basic_string::substr: __pos (which is 1277) > this->size() (which is 575)
[2024-06-24 02:46:09.298] [error] basic_string::substr: __pos (which is 15707) > this->size() (which is 2733)
[2024-06-24 02:46:09.303] [error] basic_string::substr: __pos (which is 14896) > this->size() (which is 3361)
[2024-06-24 02:46:10.134] [error] basic_string::substr: __pos (which is 4861) > this->size() (which is 2706)
[2024-06-24 02:46:10.158] [error] basic_string::substr: __pos (which is 8160) > this->size() (which is 815)
[2024-06-24 02:46:10.640] [error] basic_string::substr: __pos (which is 28165) > this->size() (which is 5927)
[2024-06-24 02:46:10.977] [error] basic_string::substr: __pos (which is 73316) > this->size() (which is 20669)
[2024-06-24 03:02:09.666] [error] basic_string::substr: __pos (which is 2290) > this->size() (which is 1037)
[2024-06-24 03:02:12.099] [error] basic_string::substr: __pos (which is 3136) > this->size() (which is 733)
[2024-06-24 03:02:13.410] [error] basic_string::substr: __pos (which is 11756) > this->size() (which is 734)
[2024-06-24 03:02:13.512] [error] basic_string::substr: __pos (which is 4732) > this->size() (which is 1168)
[2024-06-24 03:02:13.523] [error] basic_string::substr: __pos (which is 4431) > this->size() (which is 1259)
[2024-06-24 03:02:13.526] [error] basic_string::substr: __pos (which is 6791) > this->size() (which is 2150)
[2024-06-24 03:02:13.579] [error] basic_string::substr: __pos (which is 23626) > this->size() (which is 5390)
[2024-06-24 03:02:13.607] [error] basic_string::substr: __pos (which is 6450) > this->size() (which is 2395)
[2024-06-24 03:02:13.665] [error] basic_string::substr: __pos (which is 10489) > this->size() (which is 9090)
[2024-06-24 03:02:13.718] [error] basic_string::substr: __pos (which is 11403) > this->size() (which is 2602)
[2024-06-24 03:02:15.140] [error] basic_string::substr: __pos (which is 3968) > this->size() (which is 649)
[2024-06-24 03:02:15.445] [error] basic_string::substr: __pos (which is 7427) > this->size() (which is 1723)
[2024-06-24 03:02:15.480] [error] basic_string::substr: __pos (which is 15290) > this->size() (which is 2080)
[2024-06-24 03:02:16.531] [error] basic_string::substr: __pos (which is 67056) > this->size() (which is 5094)
[2024-06-24 03:02:16.689] [error] basic_string::substr: __pos (which is 8441) > this->size() (which is 6599)
[2024-06-24 03:02:16.699] [error] basic_string::substr: __pos (which is 7125) > this->size() (which is 1588)
[2024-06-24 03:02:16.755] [error] basic_string::substr: __pos (which is 8333) > this->size() (which is 7416)
[2024-06-24 03:02:17.567] [error] basic_string::substr: __pos (which is 1442) > this->size() (which is 466)
[2024-06-24 03:02:17.574] [error] basic_string::substr: __pos (which is 18042) > this->size() (which is 16029)
[2024-06-24 03:02:18.258] [error] basic_string::substr: __pos (which is 16413) > this->size() (which is 1517)
[2024-06-24 03:02:18.430] [error] basic_string::substr: __pos (which is 17100) > this->size() (which is 766)
[2024-06-24 03:02:19.194] [error] basic_string::substr: __pos (which is 2109) > this->size() (which is 581)
[2024-06-24 03:02:19.207] [error] basic_string::substr: __pos (which is 31814) > this->size() (which is 3774)
[2024-06-24 03:02:19.363] [error] basic_string::substr: __pos (which is 11689) > this->size() (which is 1072)
malton-ont commented 3 months ago

Hi @mjlarsen,

We're working on it and hope to have a fix in an upcoming version.

mjlarsen commented 1 month ago

We're working on it and hope to have a fix in an upcoming version.

just tested the new v0.7.3 and I am still encountering the exact same error when using duplex.

What are the chances of getting this fixed soon? Do you have a timeline or an estimate for when we can expect a solution?

Thank you for your help.

vellamike commented 1 month ago

Hi @mjlarsen - this will be fixed in the next release of Dorado.

mattloose commented 1 month ago

I'm seeing this on a lot of runs - does this mean that some reads fail duplex calling?

mjlarsen commented 1 month ago

Hi @mjlarsen - this will be fixed in the next release of Dorado.

Really looking forward to the next Dorado release with the fix for duplex calling. I've got a bunch of samples waiting for basecalling, but it doesn't make much sense to go ahead with the new v5 model if I can't get a duplex consensus. Do you have an idea when the next release will be out?

Thanks!

malton-ont commented 1 month ago

I'm seeing this on a lot of runs - does this mean that some reads fail duplex calling?

@mattloose,

The duplex basecalling has succeeded, but the modification calling will have failed.

cabbagesofdoom commented 3 weeks ago

Hi. I am also getting this error and a lot of runs are affected. Would it be possible to have an ETA on the version with the fix so that we can plan accordingly? (I won't bother basecalling without mods if it's going to be a week or so.) Thanks!

malton-ont commented 2 weeks ago

This issue should now be resolved in dorado 0.8.0, released yesterday.

janpb commented 1 week ago

@malton-ont: It seems to be fixed, I do not get these errors anymore with my data sets. Thanks.

It would help if somebody else also can confirm this message is gone on their data set.