nanoporetech / dorado

Oxford Nanopore's Basecaller
https://nanoporetech.com/
Other
439 stars 53 forks source link

`[error] basic_string::substr: __pos` error with duplex basecalling #879

Closed diego-rt closed 2 weeks ago

diego-rt commented 2 weeks ago

Hi again,

I realised today that dorado duplex is giving me the following error on pretty much every basecalling job I submit. I'm splitting the pod5 by channel and parallelising across several A100 GPUs on an HPC environment. I'm using dorado 0.7.1 with v5 models and modification calling. The jobs are not dying thought but I'm concerned about whether some of the data might be corrupted. It would be good to know whether I should wait for a fix or keep basecalling.

dorado duplex -r sup@v5.0.0,5mCG_5hmCG chunk_pod5/ > R013.basecalls.bam
[diego.terrones@clip-login-0 hifiasm]$ cat .command.log 
INFO:    Environment variable SINGULARITYENV_TMPDIR is set, but APPTAINERENV_TMPDIR is preferred
INFO:    Environment variable SINGULARITYENV_NXF_DEBUG is set, but APPTAINERENV_NXF_DEBUG is preferred
[2024-06-10 02:37:30.487] [info] Running: "duplex" "-r" "sup@v5.0.0,5mCG_5hmCG" "chunk_pod5/"
[2024-06-10 02:37:30.715] [info] > No duplex pairs file provided, pairing will be performed automatically
[2024-06-10 02:37:40.517] [info] cuda:0 using chunk size 12288, batch size 416
[2024-06-10 02:37:43.680] [info] cuda:0 using chunk size 10000, batch size 2304
[2024-06-10 02:37:44.845] [info] > Starting Stereo Duplex pipeline
[2024-06-10 02:37:44.882] [info] > Reading read channel info
[2024-06-10 02:37:45.161] [info] > Processed read channel info
[2024-06-10 02:37:51.702] [error] basic_string::substr: __pos (which is 90614) > this->size() (which is 18051)
[2024-06-10 02:37:51.752] [info] > Simplex reads basecalled: 142
[2024-06-10 02:37:51.752] [info] > Duplex reads basecalled: 0
[2024-06-10 02:37:51.752] [info] > Duplex rate: 1.466464%
[2024-06-10 02:37:51.752] [info] > Basecalled @ Bases/s: 1.301370e+05

[diego.terrones@clip-login-0 hifiasm]$ cat .command.log 
INFO:    Environment variable SINGULARITYENV_TMPDIR is set, but APPTAINERENV_TMPDIR is preferred
INFO:    Environment variable SINGULARITYENV_NXF_DEBUG is set, but APPTAINERENV_NXF_DEBUG is preferred
[2024-06-10 02:38:17.632] [info] Running: "duplex" "-r" "sup@v5.0.0,5mCG_5hmCG" "chunk_pod5/"
[2024-06-10 02:38:17.877] [info] > No duplex pairs file provided, pairing will be performed automatically
[2024-06-10 02:38:26.037] [info] cuda:0 using chunk size 12288, batch size 416
[2024-06-10 02:38:29.075] [info] cuda:0 using chunk size 10000, batch size 2304
[2024-06-10 02:38:29.999] [info] > Starting Stereo Duplex pipeline
[2024-06-10 02:38:30.026] [info] > Reading read channel info
[2024-06-10 02:38:30.150] [info] > Processed read channel info
[2024-06-10 02:38:55.264] [error] basic_string::substr: __pos (which is 103325) > this->size() (which is 8364)
[2024-06-10 02:39:04.062] [info] > Simplex reads basecalled: 467
[2024-06-10 02:39:04.062] [info] > Duplex reads basecalled: 5
[2024-06-10 02:39:04.062] [info] > Duplex rate: 2.6557457%
[2024-06-10 02:39:04.062] [info] > Basecalled @ Bases/s: 2.796342e+05

[diego.terrones@clip-login-0 hifiasm]$ cat .command.log 
INFO:    Environment variable SINGULARITYENV_TMPDIR is set, but APPTAINERENV_TMPDIR is preferred
INFO:    Environment variable SINGULARITYENV_NXF_DEBUG is set, but APPTAINERENV_NXF_DEBUG is preferred
[2024-06-10 02:57:16.233] [info] Running: "duplex" "-r" "sup@v5.0.0,5mCG_5hmCG" "chunk_pod5/"
[2024-06-10 02:57:16.356] [info] > No duplex pairs file provided, pairing will be performed automatically
[2024-06-10 02:57:23.930] [info] cuda:0 using chunk size 12288, batch size 416
[2024-06-10 02:57:27.244] [info] cuda:0 using chunk size 10000, batch size 2304
[2024-06-10 02:57:28.443] [info] > Starting Stereo Duplex pipeline
[2024-06-10 02:57:28.472] [info] > Reading read channel info
[2024-06-10 02:57:28.628] [info] > Processed read channel info
[2024-06-10 03:00:27.648] [error] basic_string::substr: __pos (which is 41560) > this->size() (which is 34552)
[2024-06-10 03:00:28.238] [error] basic_string::substr: __pos (which is 35513) > this->size() (which is 34011)
[2024-06-10 03:00:56.190] [info] > Simplex reads basecalled: 1684
[2024-06-10 03:00:56.190] [info] > Duplex reads basecalled: 11
[2024-06-10 03:00:56.190] [info] > Duplex rate: 1.5166574%
[2024-06-10 03:00:56.190] [info] > Basecalled @ Bases/s: 3.555989e+05

[diego.terrones@clip-login-0 hifiasm]$ cat .command.log 
INFO:    Environment variable SINGULARITYENV_TMPDIR is set, but APPTAINERENV_TMPDIR is preferred
INFO:    Environment variable SINGULARITYENV_NXF_DEBUG is set, but APPTAINERENV_NXF_DEBUG is preferred
[2024-06-10 06:53:55.653] [info] Running: "duplex" "-r" "sup@v5.0.0,5mCG_5hmCG" "chunk_pod5/"
[2024-06-10 06:53:55.878] [info] > No duplex pairs file provided, pairing will be performed automatically
[2024-06-10 06:54:03.762] [info] cuda:0 using chunk size 12288, batch size 416
[2024-06-10 06:54:06.787] [info] cuda:0 using chunk size 10000, batch size 2304
[2024-06-10 06:54:07.700] [info] > Starting Stereo Duplex pipeline
[2024-06-10 06:54:07.726] [info] > Reading read channel info
[2024-06-10 06:54:07.913] [info] > Processed read channel info
[2024-06-10 07:02:53.659] [error] basic_string::substr: __pos (which is 47341) > this->size() (which is 6445)
[2024-06-10 07:03:06.474] [error] basic_string::substr: __pos (which is 39225) > this->size() (which is 4817)
[2024-06-10 07:03:06.813] [error] basic_string::substr: __pos (which is 23388) > this->size() (which is 17844)
[2024-06-10 07:07:49.735] [error] basic_string::substr: __pos (which is 98584) > this->size() (which is 25498)
[2024-06-10 07:07:49.885] [error] basic_string::substr: __pos (which is 34560) > this->size() (which is 9605)
[2024-06-10 07:07:51.064] [error] basic_string::substr: __pos (which is 55965) > this->size() (which is 28394)
[2024-06-10 07:10:21.234] [error] basic_string::substr: __pos (which is 18002) > this->size() (which is 3562)
[2024-06-10 07:10:21.374] [error] basic_string::substr: __pos (which is 189108) > this->size() (which is 42444)
[2024-06-10 07:10:21.984] [error] basic_string::substr: __pos (which is 208669) > this->size() (which is 64989)
[2024-06-10 07:10:49.227] [error] basic_string::substr: __pos (which is 104029) > this->size() (which is 10141)
[2024-06-10 07:10:49.240] [error] basic_string::substr: __pos (which is 142944) > this->size() (which is 4945)
[2024-06-10 07:10:49.710] [error] basic_string::substr: __pos (which is 5906) > this->size() (which is 1958)
[2024-06-10 07:11:17.285] [info] > Simplex reads basecalled: 15179
[2024-06-10 07:11:17.285] [info] > Duplex reads basecalled: 148
[2024-06-10 07:11:17.285] [info] > Duplex rate: 2.088164%
[2024-06-10 07:11:17.285] [info] > Basecalled @ Bases/s: 5.124254e+05

Some do end up seg faulting though:

INFO:    Environment variable SINGULARITYENV_TMPDIR is set, but APPTAINERENV_TMPDIR is preferred
INFO:    Environment variable SINGULARITYENV_NXF_DEBUG is set, but APPTAINERENV_NXF_DEBUG is preferred
[2024-06-10 04:44:35.309] [info] Running: "duplex" "-r" "sup@v5.0.0,5mCG_5hmCG" "chunk_pod5/"
[2024-06-10 04:44:35.557] [info] > No duplex pairs file provided, pairing will be performed automatically
[2024-06-10 04:44:44.911] [info] cuda:0 using chunk size 12288, batch size 416
[2024-06-10 04:44:47.953] [info] cuda:0 using chunk size 10000, batch size 2304
[2024-06-10 04:44:48.951] [info] > Starting Stereo Duplex pipeline
[2024-06-10 04:44:48.980] [info] > Reading read channel info
[2024-06-10 04:44:49.604] [info] > Processed read channel info
[2024-06-10 04:54:57.641] [error] basic_string::substr: __pos (which is 68749) > this->size() (which is 10271)
[2024-06-10 04:54:57.644] [error] basic_string::substr: __pos (which is 4989) > this->size() (which is 1334)
[2024-06-10 04:54:58.066] [error] basic_string::substr: __pos (which is 97648) > this->size() (which is 81984)
[2024-06-10 04:54:58.077] [error] basic_string::substr: __pos (which is 23256) > this->size() (which is 12293)
[2024-06-10 04:54:58.430] [error] basic_string::substr: __pos (which is 38729) > this->size() (which is 11909)
[2024-06-10 04:54:58.456] [error] basic_string::substr: __pos (which is 106317) > this->size() (which is 5373)
[2024-06-10 04:54:58.516] [error] basic_string::substr: __pos (which is 73615) > this->size() (which is 15841)
[2024-06-10 04:54:58.926] [error] basic_string::substr: __pos (which is 109256) > this->size() (which is 96938)
[2024-06-10 04:54:58.953] [error] basic_string::substr: __pos (which is 123645) > this->size() (which is 16955)
/scratch-cbe/users/diego.terrones/1_basecalling/16/8687ad76f4ea38655beda31f19667e/.command.sh: line 7: 18969 Segmentation fault      dorado duplex -r sup@v5.0.0,5mCG_5hmCG chunk_pod5/ > R030.basecalls.bam
malton-ont commented 2 weeks ago

Hi @diego-rt. This is the same as 855. Please follow that issue for updates.