hanyue36 / nanoplexer

Tool for demultiplexing Nanopore barcode sequence data
MIT License
17 stars 2 forks source link

batch size option #11

Open anagtz opened 1 year ago

anagtz commented 1 year ago

Hi, Could you please tell me a bit more about the batch option? All my reads are merged into a huge file (the result after a combined simplex + duplex basecalling) and nanoplexer crashes with a segmentation fault when I launch it. However, if I try on a subset of the file with 4000 reads, the program does work, so I'm wondering if I need to increase this batch size option or split my merged reads, or something else. Thanks, Ana

anagtz commented 1 year ago

PS. I increased the batch to 250,000,000 and I get a "munmap_chunk(): invalid pointer" error. I have a 392 G of RAM

hanyue36 commented 1 year ago

PS. I increased the batch to 250,000,000 and I get a "munmap_chunk(): invalid pointer" error. I have a 392 G of RAM

If you hope to change batch size, use "250M" or "250m".

anagtz commented 1 year ago

Hi Yue,

Thank you for your answer. What actually worked was to split the big file into smaller files of 4,000. However, all my files are corrupted in the same way as described above. I've tried with all my runs sequenced as pod5 and this is happening, however, when the runs are sequenced as fast5, this doesn't happen. Do you have an idea of why this could be happening? I would appreciate any feedback

Thanks, Ana

anagtz commented 1 year ago

I'm sorry, I just realized I didn't describe how the files get corrupted...

When I split the huge file into small ones, each with 4,000 reads trying to resemble what Nanopore does, the headings of all the demultiplexed files are messed up:

@3a348467-6b57-43ac-beb7-ef3dbacc433e;d668b627-2f3b-4bc6-ad68-ac159bf3df26 (null) CTGGTTACCTTGTT[.......]CGGCTGGCACTAGTACGTGAAG + 4DCGL{GD{C{{DBI99[.......]ADCFD@CC9:;GHEEBDC?<@7b9cd490-f41d-4ed3-8cb1-610f08acd19f;623d085a-2c51-4f4d-b4f8-8c1945f4295a (null) TTCACGTACTAGTGC[.......]GGGATTAAG +

I get a "(null)" after each header and then the headers from the 2nd read on, are merged with the quality lines of the previous read. Even worst, the header of the read trims out the 4 characters of the previous quality line

Please help me understand what's going on here. I checked my original file and the split ones and the newline character is a normal "\n" one. I did some tests and this only happens when the sequencing was done with pod5 and basecalled with dorado. If sequenced with fast5 and basecalled with guppy, this doesn't happen