Closed andrewgalbraith21 closed 3 days ago
Could you look through the log (in the output directory) and report if there are any errors reported there? Alternatively if you'd like to post the log I'd be happy to take a look.
Hello @marcus1487, thank you for the quick reply. I looked through the log file but did not see any errors in it. Note, I have fixed the issue by just using a single pod5 as input instead of a folder of pod5s.
Hello,
I was running remora dataset prepare and noticed that my number of chunks was much smaller than expected. I looked at the output file and realized that the extraction procedure appeared to stop after only going through about one quarter of the reads:
remora dataset prepare --overwrite $pod5_path $bam_path --output-path $train_output \ --motif G 0 --max-chunks-per-read 1 \ --mod-base-control --num-extract-alignment-workers 24 --num-extract-chunks-workers 24 \ --chunk-context 100 100 --kmer-context-bases 4 4 --focus-reference-positions $extract_bed
Indexing BAM by parent read id: 10173281 Reads [03:28, 48849.08 Reads/s] [14:57:12.811] Extracting read IDs from POD5 [14:59:00.470] Found 7,579,029 valid BAM records. Found signal in POD5 for 100.00% of BAM records. [14:59:01.071] Making reference-anchored training data [14:59:01.071] Opening dataset for output [14:59:01.135] Processing reads Extracting chunks: 23%|██▎ | 1741360/7578825 [58:55<3:17:32, 492.52 Reads/s] Stops at 23%? [15:57:59.923] Unsuccessful read/chunk reasons: 181,152 : Sequence too long
[15:57:59.951] Extracted 921,513 chunks from 7,579,029 reads. [15:58:00.011] Label distribution: control:921,513 [15:58:00.012] Shuffling dataset [15:58:19.590] Done
Could you please let me know why the extraction process could have stopped early? I have not had this problem any other time I had run remora dataset prepare.