nanoporetech / dorado

Oxford Nanopore's Basecaller
https://nanoporetech.com/
Other
480 stars 59 forks source link

Error during basecalling with Rapid Barcoding kit #986

Closed desmodus1984 closed 1 week ago

desmodus1984 commented 4 weeks ago

Issue Report

I have previously basecalled data with the Rapid Sequencing kit (SQK-RAD114), and I sequenced now some samples with the Rapid Barcoding kit (SQK-RBK114.24)

I ran

Run environment:

And I modified my previously working code:

export OMP_NUM_THREADS=35
/home/juaguila/appz/dorado-0.6.2-linux-x64/bin/dorado basecaller --min-qscore 5 --emit-fastq -x "cpu" \
        sup /home/juaguila/Ju760-basecalling/1st-P2 \
        > Ju760.Lig.P2.fastq

to this:

export OMP_NUM_THREADS=40
/home/juaguila/appz/dorado-0.7.3-linux-x64/bin/dorado basecaller --min-qscore 5 \
        --emit-fastq -x "cpu" --kit-name SQK-RBK114-24 \
        sup /RBK-xtra > RBK-xtra.fastq

And now, dorado is prompting a chemistry error.

[2024-08-12 15:37:45.239] [info] Running: "basecaller" "--min-qscore" "5" "--emit-fastq" "-x" "cpu" "--kit-name" "SQK-RBK114-24" "sup" "/RBK-xtra" [2024-08-12 15:37:45.377] [info] - Note: FASTQ output is not recommended as not all data can be preserved. [2024-08-12 15:37:45.392] [error] Failed to determine sequencing chemistry from data. Please select a model by path

Logs

dorado -v [2024-08-12 15:53:50.675] [info] Running: "-v" 0.7.3+6e6c45c

Do I need to provide any additional parameter for it to work with the Rapid Barcoding Kit 24 V14????

Thanks;

desmodus1984 commented 4 weeks ago

I created a symlink as suggested before:

ln -s /home/juaguila/appz/dorado-0.6.2-linux-x64/bin/dna_r10.4.1_e8.2_400bps_sup@v4.3.0 dna_r10.4.1_e8.2_400bps_sup@v4.3.0

Then, I changed the code

export OMP_NUM_THREADS=40
/home/juaguila/appz/dorado-0.7.3-linux-x64/bin/dorado basecaller --min-qscore 5 \
        --emit-fastq -x "cpu" --kit-name SQK-RBK114-24 \
        /dna_r10.4.1_e8.2_400bps_sup@v4.3.0 /RBK-xtra > RBK-xtra.fastq

And I got a new error now: [2024-08-12 16:25:04.927] [info] - Note: FASTQ output is not recommended as not all data can be preserved. terminate called after throwing an instance of 'std::runtime_error' what(): toml::parse: file open error -> /dna_r10.4.1_e8.2_400bps_sup@v4.3.0/config.toml CPU-dorado.RBK-xtra.1.sh: line 4: 62789 Aborted /home/juaguila/appz/dorado-0.7.3-linux-x64/bin/dorado basecaller --min-qscore 5 --emit-fastq -x "cpu" --kit-name SQK-RBK114-24 /dna_r10.4.1_e8.2_400bps_sup@v4.3.0 /RBK-xtra > RBK-xtra.fastq

sklages commented 4 weeks ago

Are you sure about the paths? /dna_r10.4.1_e8.2_400bps_sup@v4.3.0 /RBK-xtra ... both point to files/folder in the root directory of your filesystem. This is probably the cause of your errors ..

My 2p, but you won't get happy with P2 data and CPU-basecalling ... this will take ages! OMP_NUM_THREADS will have no effect on basecalling. dorado uses appr five cores at a time for cpu-bound basecalling ... that is really slow.

desmodus1984 commented 4 weeks ago

Hi @sklages

That is what I have. The HPC people here put up a node with Titan V GPUs for people needing GPU, and dorado doesn't work with that GPU! and we do not have money to buy one of the GPUs that dorado is "optimized" for. Thus, I am bound to do basecalling in CPU mode anyways.

Moreover, I feel like wasting money with Nanopore anyway. I used the Rapid Barcoding kit, and I got less than 3GB, ~ 2.9GB, for 4 worm samples, where one has a genome of 200 MB. Thus, even for this only one species, 15X. And, to make it even worse, only half of the data is useful, in the best scenario of using sequences with a Phred score of 15.

That's why CPU basecalling should be optimized when CPUs are cheaper than fancy GPUs.

Documentation expects people to be experts, which I am not, and I - after wasting much time- figured out how to fix it. Since dorado complained about model, and required a model path, I created a symlink as told before and used just the name of the model (bold)

export OMP_NUM_THREADS=40
/home/juaguila/appz/dorado-0.7.3-linux-x64/bin/dorado basecaller --min-qscore 5 \
        --emit-fastq -x "cpu" --kit-name SQK-RBK114-24 \
        **dna_r10.4.1_e8.2_400bps_sup@v4.3.0** /home/juaguila/Ju760-basecalling/RBK-xtra > RBK-xtra.fastq

The model should be plain text (no path), but nonetheless, the pod5 files directory always needs to be full path, I don't understand. If I use just /RBK-xtra it doesn't work. Very weird.

Also, I use the barcoding/demultiplexing and I thought that I would get several fastq files for the classified/missclassified sequences, and I only got a single file. Did I miss a parameter? dorado didn't complain about a missing parameter.

sklages commented 4 weeks ago

Hi @sklages

That is what I have. The HPC people here put up a node with Titan V GPUs for people needing GPU, and dorado doesn't work with that GPU! and we do not have money to buy one of the GPUs that dorado is "optimized" for. Thus, I am bound to do basecalling in CPU mode anyways.

Did you try or just read the "platform" of the docs? We did basecalling on a bunch of (old) GeForce RTX 2080 which worked pretty well, although not officially "optimized" for..

Moreover, I feel like wasting money with Nanopore anyway. I used the Rapid Barcoding kit, and I got less than 3GB, ~ 2.9GB, for 4 worm samples, where one has a genome of 200 MB. Thus, even for this only one species, 15X. And, to make it even worse, only half of the data is useful, in the best scenario of using sequences with a Phred score of 15.

It is not just the ONT device, it is input material - quality, amount - library prep, library amount, quite some pitfalls to generate "bad data".

That's why CPU basecalling should be optimized when CPUs are cheaper than fancy GPUs.

It is not a matter of "cheaper", GPUs are simply more powerful to do the job here. I did some testing with an older version of dorado using a single POD5 file with 4K reads. It took less than 2 minutes on a A100/40G and more than six hours in CPU mode .. (you cannot control the number of threads used for basecalling, doradouses appr five CPUs on average).

Documentation expects people to be experts, which I am not, and I - after wasting much time- figured out how to fix it. Since dorado complained about model, and required a model path, I created a symlink as told before and used just the name of the model (bold)

Documentation is now better than at the beginning .. it is short, but precise. Minknow documentation is by far worse :-)

export OMP_NUM_THREADS=40

.. is of no use ..

/home/juaguila/appz/dorado-0.7.3-linux-x64/bin/dorado basecaller --min-qscore 5 \
        --emit-fastq -x "cpu" --kit-name SQK-RBK114-24 \
        **dna_r10.4.1_e8.2_400bps_sup@v4.3.0** /home/juaguila/Ju760-basecalling/RBK-xtra > RBK-xtra.fastq

The model should be plain text (no path), but nonetheless, the pod5 files directory always needs to be full path,

Well, yes, not a real problem. POD5 can be searched recursively for basecalling (--recursively), so it is probably smarter to provide a directory for input data .. It is just a different type of parameter.

I don't understand. If I use just /RBK-xtra it doesn't work. Very weird. Because / is the root file system of your OS. You are not allowed to store/write in / (and thus to /RBK-xtra). OTOH /home/juaguila/ is your home directory, that is where you are allowed to write/store data. So you will probably never find any data stored in /RBK-xtra. These are Linux basics and are not related to dorado.

Also, I use the barcoding/demultiplexing and I thought that I would get several fastq files for the classified/missclassified sequences, and I only got a single file. Did I miss a parameter? dorado didn't complain about a missing parameter.

Again, stick to BAM as long as you can. Inline demultiplexing organizes/stores the barcode information (with)in read groups in a single file, BAM or fastq. To get separate files for each of the barcode you need to run dorado demux <..> in a second step (which is cpu-bound), just as described in https://github.com/nanoporetech/dorado?tab=readme-ov-file#barcode-classification

desmodus1984 commented 4 weeks ago

Hi

> Did you try or just read the "platform" of the docs? We did basecalling on a bunch of (old) GeForce RTX 2080 which worked pretty well, although not officially "optimized" for..

The one that we have here at the HPC is the one that I told you, Titan V GPUs, 24 CPU cores, and 128 GB of memory, it has enough memory, and dorado just doesn't work with it at all.

> It is not just the ONT device, it is input material - quality, amount - library prep, library amount, quite some pitfalls to generate "bad data".

Again more excuses, I got 2.9 GB of data, with fresh DNA, pure - good 260/280,260/230 ratios, yet very low yield. My friend did another sequencing attempt with even better DNA, and got less than 2GB. And that's the constant. We even bought new reagents just in case to make sure that we have high-quality material to prepare libraries, and no improvement at all.

> It is not a matter of "cheaper", GPUs are simply more powerful to do the job here. I did some testing with an older version of `dorado` using a single POD5 file with 4K reads. It took less than 2 minutes on a A100/40G and more than six hours in CPU mode .. (you cannot control the number of threads used for basecalling, `dorado`uses appr five CPUs on average).

My question is, why you cannot control the number of threads? It seems that the algorithm wasn't properly designed to be fully multithreaded.

> Documentation is now better than at the beginning .. it is short, but precise. Minknow documentation is by far worse :-)

It doesn't say clearly how to use the symlink to do the basecalling? I was told about to do it because at the HPC here, the node can connect to the outside but not the outside to the node. Then, when dorado tries to connect to the Internet, it just fails, and the job is cancelled immediately - ANOTHER PROBLEM BESIDES THE GPU CARD INCOMPATIBILITY.

> Well, yes, not a real problem. POD5 can be searched recursively for basecalling (--recursively), so it is probably smarter to provide a directory for input data .. It is just a different type of parameter.

That is what I provided, right? /home/juaguila/Ju760-basecalling/RBK-xtra I even used the directory with / and it didn't work before, it said missing parameter.

> Again, stick to BAM as long as you can. Inline demultiplexing organizes/stores the barcode information (with)in read groups in a single file, BAM or fastq. To get separate files for each of the barcode you need to run `dorado demux <..>` in a second step (which is cpu-bound), just as described in https://github.com/nanoporetech/dorado?tab=readme-ov-file#barcode-classification

It is senseless to stick to BAM when I will have to go back to fastq anyways. Again it is described but not explicit and clear.

We got a Mk1C and it was a waste of money. It was supposed to be for sequencing and basecalling, to avoid having to buy a separate laptop/workstation, but it failed awfully for sequencing, and crashed within hours of sequencing.

malton-ont commented 4 weeks ago

@desmodus1984 ,

Your primary issue has been identified by @sklages - /RBK-xtra is not looking in the local directory, it is looking in the root of your filesystem. Assuming /home/juaguila/Ju760-basecalling is your working directory, you should use just plain RBK-xtra (no slash) or ./RBK-xtra (with the .) as a relative path. We can look into improving this error message to properly indicate that the path is invalid rather than falling back on the chemistry error.

My question is, why you cannot control the number of threads? It seems that the algorithm wasn't properly designed to be fully multithreaded.

Dorado attempts to launch as many CPU basecalling threads as it can fit into the available system memory. CPU calling is heavily memory bound, so this is unlikely to be full number of system cores. See here.

You say you have access to TitanV GPUs - these should be compatible with dorado as far as I can see (these are the same architecture as GV100s, which we do explicitly list as supported). If you are having issues with this, please open a separate issue.

With regard to the model selection, we are looking at improvements in this area.

It is senseless to stick to BAM when I will have to go back to fastq anyways.

BAM files store more (and more detailed) information than fastq files and some of this information may be necessary along the way. ([2024-08-12 15:37:45.377] [info] - Note: FASTQ output is not recommended as not all data can be preserved.) We recommend sticking to BAM until the final step where you actually need fastq, and performing the conversion then.

Regarding your data quality/prep issues, I suggest asking for advice on the Nanopore community - there are many experts on this there.