nanoporetech / dorado

Oxford Nanopore's Basecaller
https://nanoporetech.com/
Other
488 stars 59 forks source link

Adapter trimming - Segmentation fault #654

Closed pawanchk closed 5 months ago

pawanchk commented 6 months ago

Hi,

I tried trimming adapters using Dorado using the post-basecalling bam file using dorado trim but it ends in this error -

Segmentation fault (core dumped).

I tried Dorado version 0.5.2+7969fab.

Can I please know how I can resolve this ?

HalfPhoton commented 6 months ago

Hi @pawanchk, Please can you share the full command you used and the log output and system information?

Kind regards, Rich

pawanchk commented 6 months ago

Hi Rich,

Thank you for your response.

This is the command I used - ~/Dorado/dorado-0.5.2-linux-x64/bin/dorado trim -v sample.pass.bam > sample.pass.trimmed.bam

sample.pass.bam is obtained from wf-basecalling workflow using the pod5 file as input, do let me know if you need more details on that.

This is the log of dorado trim -

 [debug] > adapter/primer trimming threads 231, writer threads 25
 [info] > starting adapter/primer trimming
 [debug] Processed 0 reads
 [debug] Processed 0 reads
 [debug] Processed 0 reads
 [debug] Processed 0 reads
/var/spool/pbs/mom_priv/jobs/6185766.pbs101.SC: line 15: 1289837 Segmentation fault      (core dumped) ~/Dorado/dorado-0.5.2-linux-x64/bin/dorado trim -v sample.pass.bam > sample.pass.trimmed.bam

I used these settings for the PBS script -

#PBS -l select=1:ncpus=12:mem=128G
#PBS -l walltime=12:00:00

This is the system Info -

CPU : AMD EPYC 7713
OS  : RHEL 8.4 (Ootpa)

Please let me know if any more information is needed.

HalfPhoton commented 6 months ago

Does you command run locally - i.e. not in your PBS cluster?

pawanchk commented 6 months ago

No, I ran it in the PBS cluster using these settings in my PBS script

#PBS -l select=1:ncpus=12:mem=128G
#PBS -l walltime=12:00:00
HalfPhoton commented 6 months ago

Can you run dorado trim on the data locally? I'm trying to deduce if the error lies with dorado, the data or the system.

Kind regards, Rich

pawanchk commented 6 months ago

Hi @HalfPhoton

I tried running it locally but that also ends in segmentation fault, please see below for details of the command used and the log msg -

$ ~/Dorado/dorado-0.5.2-linux-x64/bin/dorado trim -v sample.pass.bam > sample.pass.trimmed.bam
[2024-03-06 09:38:51.971] [debug] > adapter/primer trimming threads 116, writer threads 12
[2024-03-06 09:38:52.086] [info] > starting adapter/primer trimming
[2024-03-06 09:38:52.087] [debug] Processed 0 reads
[2024-03-06 09:38:52.087] [debug] Processed 0 reads
Segmentation fault (core dumped)
tijyojwad commented 6 months ago

Hi @pawanchk - can you run with -vv to get a more detailed log?

pawanchk commented 6 months ago

Hi @tijyojwad

I tried running with -vv, please see the log below, the only additional log line with this parameter is Checking adapter/primer LSK109

$ ~/Dorado/dorado-0.5.2-linux-x64/bin/dorado trim -vv sample.pass.bam > sample.pass.trimmed.bam
[2024-03-19 10:53:41.742] [debug] > adapter/primer trimming threads 116, writer threads 12
[2024-03-19 10:53:41.902] [info] > starting adapter/primer trimming
[2024-03-19 10:53:41.902] [debug] Processed 0 reads
[2024-03-19 10:53:41.902] [debug] Processed 0 reads
[2024-03-19 10:53:41.902] [trace] Checking adapter/primer LSK109
Segmentation fault (core dumped)
tijyojwad commented 6 months ago

great, looks like you're able to reproduce very easily. can you share this sample.pass.bam file?

tijyojwad commented 5 months ago

Hi @pawanchk are you able to share the file?

pawanchk commented 5 months ago

Hi @tijyojwad sorry, I missed your msg earlier

I am not able to share the data file due to data privacy issue however I downloaded a test sample data from Nanopore open datasets (https://labs.epi2me.io/tutorials/), I am going to try the trimming for this dataset and let you know how it goes.

pawanchk commented 5 months ago

Hi @tijyojwad

Following up on my previous msg earlier today, I processed the open data (gm24385_2020.09) in Epi2Me labs (https://labs.epi2me.io/tutorials/), I ran dorado trim the same way that I used for the sample I have and the same error persists, please see below -

This is the file that I used - gm24385_2020.09/analysis/r9.4.1/20200914_1354_6B_PAF27096_e7c9eae6/guppy_v4.0.11_r9.4.1_hac_prom/align_unfiltered/calls2ref.bam

This is how I ran dorado trim - ~/Dorado/dorado-0.5.2-linux-x64/bin/dorado trim -v calls2ref.bam > calls2ref.trimmed.bam

This is the error log -

[2024-04-03 16:03:59.022] [debug] > adapter/primer trimming threads 231, writer threads 25 [2024-04-03 16:11:15.273] [info] > starting adapter/primer trimming [2024-04-03 16:11:15.274] [debug] Processed 0 reads [2024-04-03 16:11:15.274] [debug] Processed 0 reads [2024-04-03 16:11:15.274] [debug] Processed 0 reads [2024-04-03 16:11:15.274] [debug] Processed 0 reads /var/spool/pbs/mom_priv/jobs/6563201.pbs101.SC: line 14: 2770327 Segmentation fault (core dumped)

Any insights on how to resolve this issue will be very helpful.

HalfPhoton commented 5 months ago

Hi @pawanchk, Do you also have this issue in Dorado 0.6.0 which was released this week?

pawanchk commented 5 months ago

Hi @HalfPhoton Thanks for your response.

I tried the open data (gm24385_2020.09) in Epi2Me labs (https://labs.epi2me.io/tutorials/) with the latest release of Dorado v0.6.0, it ran successfully without any error/segmentation fault.

This is the top and bottom part of the log -

[2024-04-04 11:06:05.794] [info] Running: "trim" "-v" "~/gm24385_2020.09/analysis/r9.4.1/20200914_1354_6B_PAF27096_e7c9eae6/guppy_v4.0.11_r9.4.1_hac_prom/align_unfiltered/calls2ref.bam" [2024-04-04 11:06:05.794] [debug] > adapter/primer trimming threads 231, writer threads 25 [2024-04-04 11:16:58.588] [info] > starting adapter/primer trimming [2024-04-04 11:17:09.466] [debug] Processed 50000 reads [2024-04-04 11:17:22.027] [debug] Processed 100000 reads [2024-04-04 11:17:34.914] [debug] Processed 150000 reads [2024-04-04 11:17:40.062] [debug] Processed 200000 reads [2024-04-04 11:17:42.442] [debug] Processed 250000 reads [2024-04-04 11:17:46.432] [debug] Processed 300000 reads [2024-04-04 11:17:52.207] [debug] Processed 350000 reads . . . . [2024-04-04 11:32:39.776] [debug] Processed 4600000 reads [2024-04-04 11:32:43.772] [debug] Processed 4650000 reads [2024-04-04 11:32:45.982] [debug] Processed 4700000 reads [2024-04-04 11:32:49.879] [debug] Processed 4750000 reads [2024-04-04 11:32:51.966] [debug] Processed 4800000 reads [2024-04-04 11:32:56.254] [debug] Processed 4850000 reads [2024-04-04 11:32:59.209] [debug] Processed 4900000 reads [2024-04-04 11:33:01.794] [debug] Total reads processed: 4938711 [2024-04-04 11:33:01.935] [info] > Simplex reads basecalled: 3454633 [2024-04-04 11:33:01.935] [info] > finished adapter/primer trimming

It also worked successfully for the sample data I have.

But one surprising thing I noticed is the bam file size after trimming is much bigger (almost twice the size) - I observed this in both the open data and the sample data I have. I would expect the file size to be reduced since it trims part of the reads. Please let me know your thoughts on this.

tijyojwad commented 5 months ago

Hi @pawanchk - that looks like a configuration bug in the dorado trim application. Instating of outputting BAM we're outputting SAM, as a result the output size is larger. We'll get this fixed in the next release. in the meantime if you run the output of trim through samtools view -b you should get a smaller file size. Sorry about that!

tijyojwad commented 4 months ago

Hi @pawanchk - a fix for the output to be BAM was merged a couple of weeks ago and is available is both v0.6.2 (release 2 weeks ago) and v0.7.0 (released this week).

pawanchk commented 3 months ago

Hi @tijyojwad

Thanks so much for the update, appreciate it a lot.