nanoporetech / dorado

Oxford Nanopore's Basecaller
https://nanoporetech.com/
Other
445 stars 54 forks source link

Sample sheet error #782

Closed mbacino closed 1 month ago

mbacino commented 2 months ago

Issue Report

Please describe the issue:

I can run the dorado basecaller without the sample sheet without error but when I add a sample sheet I get an error and basecalling cannot be performed. Here is a couple examples of a sample sheets I have tried. barcodes 01-44 24_04_30_samplesheet.csv flow_cell_id,kit ,sample_id,experiment_id,barcode 1,SQK-RBK114-96,mgolf,hit,barcode01 1,SQK-RBK114-96,mgolf,hit,barcode02 1,SQK-RBK114-96,mgolf,hit,barcode03 1,SQK-RBK114-96,mgolf,hit,barcode04 1,SQK-RBK114-96,mgolf,hit,barcode05 1,SQK-RBK114-96,mgolf,hit,barcode06

24_04_29_samplesheet.csv experiment_id,kit ,barcode hit,SQK-RBK114-96,barcode01 hit,SQK-RBK114-96,barcode02 hit,SQK-RBK114-96,barcode03 hit,SQK-RBK114-96,barcode04 hit,SQK-RBK114-96,barcode05 hit,SQK-RBK114-96,barcode06

Steps to reproduce the issue:

I have followed the sample sheet guides on github and uploaded a sample sheet to minknow to make sure it was compatible.

Run environment:

Logs

QUEUE: gpu.q SGE_GPU: 1 Job started at: 2024-04-30 11:15:42-07:00 [2024-04-30 11:15:43.898] [info] Running: "basecaller" "dna_r10.4.1_e8.2_400bps_sup@v4.3.0" "/wynton/home/lynchlab/ms-bacino/24_04_08_Fosmid_hits_EDY/input_files" "--kit-name" "SQK-RBK114-96" "--sample-sheet" "/wynton/home/lynchlab/ms-bacino/24_04_08_Fosmid_hits_EDY$ [2024-04-30 11:15:43.923] [info] > Creating basecall pipeline [2024-04-30 11:16:11.721] [info] cuda:0 using chunk size 9996, batch size 512 [2024-04-30 11:16:12.798] [info] cuda:0 using chunk size 4998, batch size 1664 [2024-04-30 11:16:14.990] [error] Sample sheet /wynton/home/lynchlab/ms-bacino/24_04_08_Fosmid_hits_EDY/input_files/24_04_29_samplesheet.csv contains invalid column flow_cell_id Job ended at: 2024-04-30 11:16:15-07:00

mkdir: cannot create directory '/tmp/lock-gpu0': File exists mkdir: cannot create directory '/tmp/lock-gpu3': File exists QUEUE: gpu.q SGE_GPU: 1 Job started at: 2024-04-30 11:45:40-07:00 [2024-04-30 11:45:41.712] [info] Running: "basecaller" "dna_r10.4.1_e8.2_400bps_sup@v4.3.0" "/wynton/home/lynchlab/ms-bacino/24_04_08_Fosmid_hits_EDY/input_files" "--kit-name" "SQK-RBK114-96" "--sample-sheet" "/wynton/home/lynchlab/ms-bacino/24_04_08_Fosmid_hits_EDY/i$ [2024-04-30 11:45:41.756] [info] > Creating basecall pipeline [2024-04-30 11:47:04.666] [info] cuda:0 using chunk size 9996, batch size 640 [2024-04-30 11:47:05.656] [info] cuda:0 using chunk size 4998, batch size 640 [2024-04-30 11:47:06.515] [error] Sample sheet /wynton/home/lynchlab/ms-bacino/24_04_08_Fosmid_hits_EDY/input_files/24_04_29_samplesheet.csv contains invalid column kit Job ended at: 2024-04-30 11:47:07-07:00

malton-ont commented 2 months ago

Hi @mbacino,

It looks like you have whitespace in the column names, which we're failing to trim. You are also missing the alias column, which is currently required if a barcode column is present. Aliases must not be an existing barcode identifier.

We will attempt to make this more robust in a future release. In the meantime you can fix your formatting to meet the existing requirements like so:

flow_cell_id,kit,sample_id,experiment_id,barcode,alias
1,SQK-RBK114-96,mgolf,hit,barcode01,patient01
1,SQK-RBK114-96,mgolf,hit,barcode02,patient02
1,SQK-RBK114-96,mgolf,hit,barcode03,patient03
1,SQK-RBK114-96,mgolf,hit,barcode04,patient04
1,SQK-RBK114-96,mgolf,hit,barcode05,patient05
1,SQK-RBK114-96,mgolf,hit,barcode06,patient06
mbacino commented 2 months ago

running the below sample sheet I got the error: [2024-05-01 08:11:52.305] [error] Sample sheet /wynton/home/lynchlab/ms-bacino/24_04_08_Fosmid_hits_EDY/input_files/corrected_24_04_30_ss.csv contains invalid column flow_cell_id

flow_cell_id,kit,sample_id,experiment_id,barcode,alias FAY52961,SQK-RBK114-96,mgolf,hit,barcode01,A FAY52961,SQK-RBK114-96,mgolf,hit,barcode02,B FAY52961,SQK-RBK114-96,mgolf,hit,barcode03,C FAY52961,SQK-RBK114-96,mgolf,hit,barcode04,D FAY52961,SQK-RBK114-96,mgolf,hit,barcode05,E FAY52961,SQK-RBK114-96,mgolf,hit,barcode06,F

malton-ont commented 2 months ago

Hi @mbacino,

That exact file works for me:

$ cat samplesheet.csv 
flow_cell_id,kit,sample_id,experiment_id,barcode,alias
FAY52961,SQK-RBK114-96,mgolf,hit,barcode01,A
FAY52961,SQK-RBK114-96,mgolf,hit,barcode02,B
FAY52961,SQK-RBK114-96,mgolf,hit,barcode03,C
FAY52961,SQK-RBK114-96,mgolf,hit,barcode04,D
FAY52961,SQK-RBK114-96,mgolf,hit,barcode05,E
FAY52961,SQK-RBK114-96,mgolf,hit,barcode06,F
$ ./dorado basecaller hac tests/data/pod5/dna_r10.4.1_e8.2_400bps_5khz/ --sample-sheet samplesheet.csv -x cuda:0 -b 384 > calls.bam
[2024-05-02 08:49:45.004] [info] Running: "basecaller" "hac" "tests/data/pod5/dna_r10.4.1_e8.2_400bps_5khz/" "--sample-sheet" "samplesheet.csv" "-x" "cuda:0" "-b" "384"
[2024-05-02 08:49:45.048] [info]  - downloading dna_r10.4.1_e8.2_400bps_hac@v4.3.0 with httplib
[2024-05-02 08:49:45.652] [info] > Creating basecall pipeline
[2024-05-02 08:49:49.692] [info] cuda:0 using chunk size 9996, batch size 384
[2024-05-02 08:49:59.974] [info] cuda:0 using chunk size 4998, batch size 384
[2024-05-02 08:50:16.762] [info] > Simplex reads basecalled: 3
[2024-05-02 08:50:16.762] [info] > Basecalled @ Samples/s: 5.725255e+02
[2024-05-02 08:50:16.772] [info] > Finished

All I can think is that you have some non-printing character in the header that isn't showing when you paste it into github. What program are you using to generate the sample sheet? Is it adding a BOM for UTF-8 encoding perhaps? Please ensure you save the file without a BOM, or try:

sed -i '1s/^\xEF\xBB\xBF//' samplesheet.txt

to remove it.

mbacino commented 2 months ago

I generated the sample sheet using excel. It was adding a BOM for UTF-8. The solution that worked was creating a csv on the HPC cluster, rather than importing it. I opened the old csv on my desktop and copy and pasted into the new csv on the cluster.