barcode identification requires randomised fastq

Psy-Fer commented 6 years ago

When running porechop, I came across unexpected output when identifying the barcodes on 10k sample reads.

It seems it takes the first 10k reads, however I concatenated the outputs of the albacore reads after demultiplexing, so they were ordered from barcode01->barcode12->unclassified. So the first 10k reads were all barcode01.

I wrote a quick script to shuffle a fastq file (python 2.7ish) shuffle_fastq.py see here: https://github.com/Psy-Fer/bioinf_tools

When I ran porechop on this new shuffled file, it detected all the correct barcodes (better than albacore i might add) and seems to be running smoothly.

A feature request would be to modify the barcode detection function to randomly sample the ingested fastq. Otherwise note in the docs would do :)

cheers.

rrwick commented 6 years ago

Yes, this one could be solved either by specifying barcodes (https://github.com/rrwick/Porechop/issues/42) or by randomly subsampling the input reads.

In the meantime, don't forget that if you give Porechop a directory as input, it will look for all read files in that directory, and then it samples from each of them to avoid this issue. And as a bonus, if the directory looks like an Albacore directory with demultiplexing, Porechop will note the Albacore barcode and put reads in the 'none' bin if it and Albacore disagree. I find this useful for reducing mis-binned reads.

Ryan

Psy-Fer commented 6 years ago

Ahh thanks for that. I was trying to do some comparisons between algorithms, without being aware of each other. So probably a low priority fix :) My shuffle script fixes the issue for now.

Cheers

rrwick / Porechop

barcode identification requires randomised fastq #43