nanoporetech / dorado

Oxford Nanopore's Basecaller
https://nanoporetech.com/
Other
481 stars 59 forks source link

custom barcodes #598

Open Andreas-Bio opened 7 months ago

Andreas-Bio commented 7 months ago

Hi, I am trying this command:

E:\ONT\dorado-0.5.2-win64\bin\dorado.exe basecaller E:\ONT\dorado-0.5.2-win64\bin\dna_r10.4.1_e8.2_400bps_sup@v4.3.0 E:\ONT\reevaladapter\pod5 --min-qscore 7 --emit-fastq -v --barcode-arrangement arrangement.toml --barcode-sequences barcodes.fastq --device cuda:all > E:\ONT\reevaladapter\barcoded.fastq
[2024-01-25 23:52:22.382] [info]  - Note: FASTQ output is not recommended as not all data can be preserved.
[2024-01-25 23:52:22.383] [info] > Creating basecall pipeline
[2024-01-25 23:52:24.507] [debug] cuda:0 memory available: 6.51GB
[2024-01-25 23:52:24.507] [debug] Auto batchsize cuda:0: memory limit 5.51GB
[2024-01-25 23:52:24.508] [debug] Auto batchsize cuda:0: testing up to 192 in steps of 64
[2024-01-25 23:52:24.681] [debug] Auto batchsize cuda:0: 64, time per chunk 1.244996 ms
[2024-01-25 23:52:24.910] [debug] Auto batchsize cuda:0: 128, time per chunk 0.852191 ms
[2024-01-25 23:52:25.165] [debug] Auto batchsize cuda:0: 192, time per chunk 0.623460 ms
[2024-01-25 23:52:25.165] [debug] Device cuda:0 Model memory 3.28GB
[2024-01-25 23:52:25.166] [debug] Device cuda:0 Decode memory 1.35GB
[2024-01-25 23:52:25.176] [info]  - set batch size for cuda:0 to 192
[2024-01-25 23:52:25.177] [debug] - adjusted chunk size to match model stride: 10000 -> 9996
[2024-01-25 23:52:25.193] [debug] Creating barcoding info for kit: arrangement.toml
[2024-01-25 23:52:25.194] [info] Barcode for arrangement.toml
[2024-01-25 23:52:25.195] [debug] - adjusted overlap to match model stride: 500 -> 498
[2024-01-25 23:52:25.196] [debug] Load reads from file E:\ONT\reevaladapter\pod5\ASA059_c574db2e_48b3538e_0.pod5
[2024-01-25 23:52:25.234] [debug] Load reads from file E:\ONT\reevaladapter\pod5\ASA059_c574db2e_48b3538e_1.pod5
[2024-01-25 23:52:25.266] [debug] Load reads from file E:\ONT\reevaladapter\pod5\ASA059_c574db2e_48b3538e_10.pod5
[2024-01-25 23:52:25.301] [debug] Load reads from file E:\ONT\reevaladapter\pod5\ASA059_c574db2e_48b3538e_11.pod5

E:\ONT\dorado-0.5.2-win64\bin>

and it just crashes without giving me any indication what went wrong.

If I replace the command with --barcode-arrangement doodeloo --barcode-sequences iamazebra the same thing happens. Like, it does not even check if the file exists?

I saw reports from other people and they have an error message. Is that because I am on a Windows machine?

The barcodes file goes like:

F1001 CACATATCAGAGTGCG F1002 ACACACAGACTGTGAG [...] R1192 TGTCTCGTGCTGAGAC

and the arrangement file is


[arrangement]
name = "custom_barcode"

mask1_front = "GGTAG"
mask1_rear = ""
mask2_front = "GGTAG"
mask2_rear = ""

# Barcode sequences
barcode1_pattern = "F%04i"
barcode2_pattern = "R%04i"
first_index = 1001
last_index = 1192

## Scoring options
[scoring]
min_soft_barcode_threshold = 0.2
min_hard_barcode_threshold = 0.2
min_soft_flank_threshold = 0.3
min_hard_flank_threshold = 0.3
min_barcode_score_dist = 0.1
tijyojwad commented 7 months ago

hmm I suspect it might be windows related - I tried the same command on my linux box and I see a failure from dorado saying the file isn't found

$ ./dorado basecaller dna_r10.4.1_e8.2_400bps_fast\@v4.1.0/ tests/data/pod5/single_na24385.pod5 --barcode-arrangement blah.blah > /dev/null 
[2024-01-29 21:00:51.251] [info] > Creating basecall pipeline
[2024-01-29 21:00:52.002] [info]  - set batch size to 480
libc++abi: terminating due to uncaught exception of type std::runtime_error: toml::parse: file open error -> blah.bla

perhaps it is checking and throwing an error, but somehow the exception message is being absorbed somehow. I will try on a windows machine soon

as such your arrangement file looks fine. can you confirm that your sequences file is in fasta format? i.e.

> F1001
AAAA
> F1002
CCCC
Andreas-Bio commented 7 months ago

grafik

Your guidelines say FASTQ.

"Specification Format

The custom arrangements are defined using a toml file, and custom barcode sequences are passed through a FASTQ file."

https://github.com/nanoporetech/dorado/blob/release-v0.5.2/documentation/CustomBarcodes.md

tijyojwad commented 7 months ago

ah thanks for catching that - should be FASTA. We'll get that updated.

I was able to repro this crash without any error reporting. Even with garbage filenames. Looks like a windows things. I'll take a closer look and get back to you