rrwick / Porechop

adapter trimmer for Oxford Nanopore reads
GNU General Public License v3.0
335 stars 125 forks source link

Add PCR 96 Barcode sequence #7

Closed donutbrew closed 6 years ago

donutbrew commented 7 years ago

Can you add the PCR96 barcodes (EXP-PBC001) to the adapters list?

https://community.nanoporetech.com/protocols/pcr-96-barcoding-amplicons-100/v/pbae96_9016_v108_revm_18oc/introduction-to-the-pcr-96

rrwick commented 7 years ago

Yes, I've been thinking about how to best tackle this one. The PCR barcodes seem to be the same as the native barcodes, or at least a subsequence of them.

Here's NB01: GGTGCTGAAGAAAGTTGTCGGTGTCTTTGTGTTAACCT And here's BC01: AAGAAAGTTGTCGGTGTCTTTGTG (from the link you gave)

They're the same except for an additional 7 bases at the start and end of NB01. Though I wonder if the actual PCR barcodes also have those 7 bases and they just aren't in that table.

If the PCR barcodes really are identical, then it's just a matter of adding 84 more of them. If they are a bit shorter, then that's trickier. Porechop automatically finds which barcodes are present, so any sample with NB01 would register for BC01 as well. I could try adding an option (something like --pcr_barcodes) which disables native barcodes and adds PCR barcodes instead.

Anyway, before I implement anything, I really need some real PCR barcoded reads to play with. Do you know of publicly available ones or do you have a handful you could share?

Ryan

rrwick commented 7 years ago

As I discussed in issue #9, I think I may have this barcode stuff figured out now. But I only have older PCR barcode reads I downloaded from ENA, so I'd be grateful if you could give it a try on your reads.

Grab a fresh version of Porechop from GitHub and give it a try!

donutbrew commented 7 years ago

Hey Ryan, I finally got a chance to try out re-demultiplexing some datasets using porechop. Overall, pretty dang good. It found my PCR barcodes automatically just fine.

I am comparing performance strictly to that of ONT's split_barcodes.pl script, which I have tweaked slightly for performance. What I found was that the default porechop demultiplexing returns about 10% fewer reads per barcode, compared to using the ONT script using a Levenshtein distance of 5. Although, we have been using a distance of 6 for most applications. Porechop bins about 50% of the total number of reads it outputs, which is comparable.

One issue I noticed is that porechop only returns about 75% of the overall number of reads, compared to the input file (which is a non filtered fastq generated from all reads). Where do these discarded reads go?

rrwick commented 7 years ago

Regarding the missing reads, one possibility is that Porechop turns on the --discard_middle option whenever it's demultiplexing barcodes. That option throws out reads with adapters in their middle. My thinking was this: since a chimeric read could be two or more separate reads, it's a lot easier to just toss them out than trying to split them up and then determining a barcode bin for each constituent part.

However, I'd only expect that to throw out a small percentage of the reads, not 25%. If you total up all the reads in the FASTQ files plus the number of discarded reads (should be towards the end of Porechop's output), does that account for everything? If Porechop really is labelling 25% of your reads as containing middle adapters, then something might be a bit screwy...

Also, Porechop won't output reads that have nothing left. I.e. if you have a tiny read that's nothing but adapters and they are all trimmed off (leaving 0 bp), then it's not included in the results. But again, I wouldn't expect those to account for much.

Ryan

rrwick commented 6 years ago

This is an old issue, so I'm going to close it now. But please let me know if you're still experiencing unresolved issues!

Ryan