Remora Dataset Prepare Ends Prematurely

andrewgalbraith21 commented 1 week ago

Hello,

I was running remora dataset prepare and noticed that my number of chunks was much smaller than expected. I looked at the output file and realized that the extraction procedure appeared to stop after only going through about one quarter of the reads:

remora dataset prepare --overwrite $pod5_path $bam_path --output-path $train_output \ --motif G 0 --max-chunks-per-read 1 \ --mod-base-control --num-extract-alignment-workers 24 --num-extract-chunks-workers 24 \ --chunk-context 100 100 --kmer-context-bases 4 4 --focus-reference-positions $extract_bed

Indexing BAM by parent read id: 10173281 Reads [03:28, 48849.08 Reads/s] [14:57:12.811] Extracting read IDs from POD5 [14:59:00.470] Found 7,579,029 valid BAM records. Found signal in POD5 for 100.00% of BAM records. [14:59:01.071] Making reference-anchored training data [14:59:01.071] Opening dataset for output [14:59:01.135] Processing reads Extracting chunks: 23%|██▎ | 1741360/7578825 [58:55<3:17:32, 492.52 Reads/s] Stops at 23%? [15:57:59.923] Unsuccessful read/chunk reasons: 181,152 : Sequence too long
[15:57:59.951] Extracted 921,513 chunks from 7,579,029 reads. [15:58:00.011] Label distribution: control:921,513 [15:58:00.012] Shuffling dataset [15:58:19.590] Done

Could you please let me know why the extraction process could have stopped early? I have not had this problem any other time I had run remora dataset prepare.

marcus1487 commented 6 days ago

Could you look through the log (in the output directory) and report if there are any errors reported there? Alternatively if you'd like to post the log I'd be happy to take a look.

andrewgalbraith21 commented 3 days ago

Hello @marcus1487, thank you for the quick reply. I looked through the log file but did not see any errors in it. Note, I have fixed the issue by just using a single pod5 as input instead of a folder of pod5s.

nanoporetech / remora

Remora Dataset Prepare Ends Prematurely #184