rrwick / Porechop

adapter trimmer for Oxford Nanopore reads
GNU General Public License v3.0
323 stars 124 forks source link

Custom barcodes not binned correctly #38

Closed squainoo closed 6 years ago

squainoo commented 6 years ago

Hi,

I am trying to demultiplex sequence reads that where pooled with custom barcodes. I followed your instructions and added my barcode sequences in the adapters.py file and removed the previous barcode sequences. Below is an example of how I added the barcodes for two samples (Barcode 1 & Barcode 2):

Adapter('Barcode 1 (forward)', start_sequence=('BC1', 'CAAACTGCAGGAGGGTAGCGCTAC'), end_sequence=('BC1_rev', 'GCAGGTGAAAGGCATTAGCGCAGG')), Adapter('Barcode 2 (forward)', start_sequence=('BC2', 'CAAACTGCAGGAGGGTAGCGCTAC'), end_sequence=('BC2_rev', 'AAGGACATAATGCATGAGCGGATC'))]

In the example above, both samples share the same forward barcode but have a different reverse barcode. Each of my samples was barcoded with a forward and reverse barcodes. The reverse barcodes are NOT the reverse complement of the forward but unique from the forward barcodes.

Now, when I run porechop with this adapter file, the barcodes are found in the sequences but reads are binned to the start sequence names instead of the actual Barcode names (The example below shows most reads in none because I am only using 2 instead of 100 barcodes for simplicity):

Looking for known adapter sets 4,000 / 4,000 (100.0%) Best
read Best
start read end Set %ID %ID
SQK-NSK007 100.0 81.8 Rapid 68.4 0.0 SQK-MAP006 80.0 83.3 SQK-MAP006 Short 80.8 77.8 PCR adapters 1 80.0 79.2 PCR tail 1 76.7 76.7 PCR tail 2 72.4 73.3 1D^2 part 1 73.3 76.7 1D^2 part 2 88.9 80.6 Barcode 1 (forward) 100.0 79.2 Barcode 2 (forward) 100.0 92.3

Barcodes determined to be in forward orientation

Trimming adapters from read ends SQK-NSK007_Y_Top: AATGTACTTCGTTCAGTTACGTATTGCT SQK-NSK007_Y_Bottom: GCAATACGTAACTGAACGAAGT BC1: CAAACTGCAGGAGGGTAGCGCTAC BC1_rev: GCAGGTGAAAGGCATTAGCGCAGG BC2: CAAACTGCAGGAGGGTAGCGCTAC BC2_rev: AAGGACATAATGCATGAGCGGATC

4,000 / 4,000 (100.0%)

3,183 / 4,000 reads had adapters trimmed from their start (102,974 bp removed) 1,489 / 4,000 reads had adapters trimmed from their end (22,429 bp removed)

Discarding reads containing middle adapters 4,000 / 4,000 (100.0%)

15 / 4,000 reads were discarded based on middle adapters

Saving trimmed reads to barcode-specific files

Barcode Reads Bases File
BC1 7 6,112 /Volumes/TOSHIBA EXT/3_Demultiplexed_trimmed/BC1.fastq BC2 6 3,774 /Volumes/TOSHIBA EXT/3_Demultiplexed_trimmed/BC2.fastq none 3,972 2,786,813 /Volumes/TOSHIBA EXT/3_Demultiplexed_trimmed/none.fastq

I assume that the issue is that barcode sets from ONT use unique barcodes for each sample (as in start_sequence would be barcode sequence and end_sequence would be the same sequence in reverse complement). Hence, when using an ONT barcode set, no sample would share any barcode sequence. And therefore reads are binned to the start_sequence "name" (BCx). This logic does not work in my case since some of my samples share the same start_sequence and porechop bins reads sharing a start_sequence to the same bin eventhough they have a different end_sequence.

For troubleshooting I tried to run the verbosity 2 setting and I can see that all start and end sequences are found in the reads. I ran with --require_two_barcodes and all reads get binned to "none". I used reverse, complement, and reverse-complement of the end_sequence barcode, with both default and --require_two_barcodes setting.

Nothing worked so far.

Could you please confirm on my suspicion of a logic problem and share your thoughts on how I could resolve this issue and get my samples demultiplexed?

I would be forever grateful!

All the best and many thanks in advance, Scott

rrwick commented 6 years ago

Hi Scott,

Sorry for the late reply - I was away for a few weeks.

I can confirm your suspicion - Porechop is failing because of the common sequences between barcodes. Here is the relevant code. In short, it is finding the barcode with the best match and the barcode with the second-best match, and if the best and second-best are too close in alignment quality, then it won't bin the read. Since your barcodes are the same at the read start, for many reads Porechop will find it's a tie between barcodes 1 and 2, and so the read is unbinned. I suspect your small number of binned reads (7 in BC1, 6 in BC2) are cases where the read lacked a good start adapter and was therefore binned using only the end adapter.

Setting --require_two_barcodes won't help because then it independently does that same check (best hit must be sufficiently better than the second best hit) for both the read start and end, and it will always fail at the read start.

You said you have 100 barcodes in total and some share start sequences. Are all end sequences unique, or are they shared too? E.g. do you have 10 unique starts and 10 unique ends to give 100 combinations?

If all end sequences are unique, then you could try removing the start_sequence from your barcodes. I.e. bin using end barcodes only.

If end barcodes are also shared, then here's the only solution I can think of. It's a bit clumsy, but it's the best Porechop could do without a major overhaul of it's logic:

  1. Define barcode sequences in Porechop using only the start_sequence.
  2. Run Porechop to bin your reads into 'start bins'. Use the --untrimmed option so it bins sequences but doesn't trim them.
  3. Go back into Porechop and redefine the barcodes using only the end_sequence. You can include your start barcodes as normal adapters (i.e. not barcodes) so they are trimmed off (they'll still be there because you used --untrimmed in the last run) but not used for binning.
  4. Run Porechop on each of the outputs from the first Porechop run.

This two-stage binning process should hopefully get you where you need. Sorry there's not a more elegant solution. I've never tried anything like this, so I hope it works! Please let me know how you go and if you have any questions.

Ryan

squainoo commented 6 years ago

Hi Ryan,

Thanks for your reply. In my case, both start and end sequences were shared, so your second approach solved my issue.

Thank you very much.

Best, Scott

rrwick commented 6 years ago

Glad to hear it!

timregan commented 3 years ago

Hi Ryan, I have the exact same issue as this (64 barcode combinations, 8 forward and 8 reverse). However, I am having difficulty. When I ran only the start_sequences for the forward barcodes, everything seemed to work really well!
./porechop-runner.py -i 16S.fastq.gz -b Porechop_16S --adapter_threshold 90 --barcode_threshold 80 --barcode_diff 2 --untrimmed However, running the same code on each bin after switching the adapters.py to only include end_sequence keeps giving me an out of range error.. As it is not clear to me in the original adapters.py file, would you mind providing an example of the format the barcode sequences should be in for the first round (forward barcodes) and second round (reverse barcodes) of binning?

Many thanks, Tim

timregan commented 3 years ago

Silly me, forgot to include the error code last time: Traceback (most recent call last): File "./porechop-runner.py", line 9, in <module> main() File "Porechop/porechop/porechop.py", line 62, in main forward_or_reverse_barcodes) File "Porechop/porechop/porechop.py", line 504, in find_adapters_at_read_ends for out in pool.imap(start_end_trim_one_arg, arg_list): File "/exports/applications/apps/SL7/anaconda/5.0.1/lib/python3.6/multiprocessing/pool.py", line 735, in next raise value File "/exports/applications/apps/SL7/anaconda/5.0.1/lib/python3.6/multiprocessing/pool.py", line 119, in worker result = (True, func(*args, **kwds)) File "Porechop/porechop/porechop.py", line 487, in start_end_trim_one_arg r.find_end_trim(a, b, c, d, e, f, g, k) File "Porechop/porechop/nanopore_read.py", line 207, in find_end_trim adapter.barcode_direction() == forward_or_reverse: File "Porechop/porechop/adapters.py", line 36, in barcode_direction if '_rev' in self.start_sequence[0]: IndexError: list index out of range

A simple look at this highlights where the problem is from. In case anyone else is having this issue, it's solved by changing line 36 of adapters.py to if '_rev' in self.end_sequence[0] NB This is only applicable to the second round of binning i.e. the reverse barcodes.

Cheers, Tim