Open Psy-Fer opened 6 years ago
Yes, this one could be solved either by specifying barcodes (https://github.com/rrwick/Porechop/issues/42) or by randomly subsampling the input reads.
In the meantime, don't forget that if you give Porechop a directory as input, it will look for all read files in that directory, and then it samples from each of them to avoid this issue. And as a bonus, if the directory looks like an Albacore directory with demultiplexing, Porechop will note the Albacore barcode and put reads in the 'none' bin if it and Albacore disagree. I find this useful for reducing mis-binned reads.
Ryan
Ahh thanks for that. I was trying to do some comparisons between algorithms, without being aware of each other. So probably a low priority fix :) My shuffle script fixes the issue for now.
Cheers
When running porechop, I came across unexpected output when identifying the barcodes on 10k sample reads.
It seems it takes the first 10k reads, however I concatenated the outputs of the albacore reads after demultiplexing, so they were ordered from barcode01->barcode12->unclassified. So the first 10k reads were all barcode01.
I wrote a quick script to shuffle a fastq file (python 2.7ish) shuffle_fastq.py see here: https://github.com/Psy-Fer/bioinf_tools
When I ran porechop on this new shuffled file, it detected all the correct barcodes (better than albacore i might add) and seems to be running smoothly.
A feature request would be to modify the barcode detection function to randomly sample the ingested fastq. Otherwise note in the docs would do :)
cheers.