Closed squainoo closed 6 years ago
Hi Scott,
Sorry for the late reply - I was away for a few weeks.
I can confirm your suspicion - Porechop is failing because of the common sequences between barcodes. Here is the relevant code. In short, it is finding the barcode with the best match and the barcode with the second-best match, and if the best and second-best are too close in alignment quality, then it won't bin the read. Since your barcodes are the same at the read start, for many reads Porechop will find it's a tie between barcodes 1 and 2, and so the read is unbinned. I suspect your small number of binned reads (7 in BC1, 6 in BC2) are cases where the read lacked a good start adapter and was therefore binned using only the end adapter.
Setting --require_two_barcodes
won't help because then it independently does that same check (best hit must be sufficiently better than the second best hit) for both the read start and end, and it will always fail at the read start.
You said you have 100 barcodes in total and some share start sequences. Are all end sequences unique, or are they shared too? E.g. do you have 10 unique starts and 10 unique ends to give 100 combinations?
If all end sequences are unique, then you could try removing the start_sequence
from your barcodes. I.e. bin using end barcodes only.
If end barcodes are also shared, then here's the only solution I can think of. It's a bit clumsy, but it's the best Porechop could do without a major overhaul of it's logic:
start_sequence
.--untrimmed
option so it bins sequences but doesn't trim them.end_sequence
. You can include your start barcodes as normal adapters (i.e. not barcodes) so they are trimmed off (they'll still be there because you used --untrimmed
in the last run) but not used for binning.This two-stage binning process should hopefully get you where you need. Sorry there's not a more elegant solution. I've never tried anything like this, so I hope it works! Please let me know how you go and if you have any questions.
Ryan
Hi Ryan,
Thanks for your reply. In my case, both start and end sequences were shared, so your second approach solved my issue.
Thank you very much.
Best, Scott
Glad to hear it!
Hi Ryan,
I have the exact same issue as this (64 barcode combinations, 8 forward and 8 reverse). However, I am having difficulty. When I ran only the start_sequences for the forward barcodes, everything seemed to work really well!
./porechop-runner.py -i 16S.fastq.gz -b Porechop_16S --adapter_threshold 90 --barcode_threshold 80 --barcode_diff 2 --untrimmed
However, running the same code on each bin after switching the adapters.py to only include end_sequence keeps giving me an out of range error..
As it is not clear to me in the original adapters.py file, would you mind providing an example of the format the barcode sequences should be in for the first round (forward barcodes) and second round (reverse barcodes) of binning?
Many thanks, Tim
Silly me, forgot to include the error code last time:
Traceback (most recent call last):
File "./porechop-runner.py", line 9, in <module> main()
File "Porechop/porechop/porechop.py", line 62, in main forward_or_reverse_barcodes)
File "Porechop/porechop/porechop.py", line 504, in find_adapters_at_read_ends for out in pool.imap(start_end_trim_one_arg, arg_list):
File "/exports/applications/apps/SL7/anaconda/5.0.1/lib/python3.6/multiprocessing/pool.py", line 735, in next raise value
File "/exports/applications/apps/SL7/anaconda/5.0.1/lib/python3.6/multiprocessing/pool.py", line 119, in worker result = (True, func(*args, **kwds))
File "Porechop/porechop/porechop.py", line 487, in start_end_trim_one_arg r.find_end_trim(a, b, c, d, e, f, g, k)
File "Porechop/porechop/nanopore_read.py", line 207, in find_end_trim adapter.barcode_direction() == forward_or_reverse:
File "Porechop/porechop/adapters.py", line 36, in barcode_direction if '_rev' in self.start_sequence[0]:
IndexError: list index out of range
A simple look at this highlights where the problem is from. In case anyone else is having this issue, it's solved by changing line 36 of adapters.py to if '_rev' in self.end_sequence[0]
NB This is only applicable to the second round of binning i.e. the reverse barcodes.
Cheers, Tim
Hi,
I am trying to demultiplex sequence reads that where pooled with custom barcodes. I followed your instructions and added my barcode sequences in the adapters.py file and removed the previous barcode sequences. Below is an example of how I added the barcodes for two samples (Barcode 1 & Barcode 2):
Adapter('Barcode 1 (forward)', start_sequence=('BC1', 'CAAACTGCAGGAGGGTAGCGCTAC'), end_sequence=('BC1_rev', 'GCAGGTGAAAGGCATTAGCGCAGG')), Adapter('Barcode 2 (forward)', start_sequence=('BC2', 'CAAACTGCAGGAGGGTAGCGCTAC'), end_sequence=('BC2_rev', 'AAGGACATAATGCATGAGCGGATC'))]
In the example above, both samples share the same forward barcode but have a different reverse barcode. Each of my samples was barcoded with a forward and reverse barcodes. The reverse barcodes are NOT the reverse complement of the forward but unique from the forward barcodes.
Now, when I run porechop with this adapter file, the barcodes are found in the sequences but reads are binned to the start sequence names instead of the actual Barcode names (The example below shows most reads in none because I am only using 2 instead of 100 barcodes for simplicity):
Looking for known adapter sets 4,000 / 4,000 (100.0%) Best
read Best
start read end Set %ID %ID
SQK-NSK007 100.0 81.8 Rapid 68.4 0.0 SQK-MAP006 80.0 83.3 SQK-MAP006 Short 80.8 77.8 PCR adapters 1 80.0 79.2 PCR tail 1 76.7 76.7 PCR tail 2 72.4 73.3 1D^2 part 1 73.3 76.7 1D^2 part 2 88.9 80.6 Barcode 1 (forward) 100.0 79.2 Barcode 2 (forward) 100.0 92.3
Barcodes determined to be in forward orientation
Trimming adapters from read ends SQK-NSK007_Y_Top: AATGTACTTCGTTCAGTTACGTATTGCT SQK-NSK007_Y_Bottom: GCAATACGTAACTGAACGAAGT BC1: CAAACTGCAGGAGGGTAGCGCTAC BC1_rev: GCAGGTGAAAGGCATTAGCGCAGG BC2: CAAACTGCAGGAGGGTAGCGCTAC BC2_rev: AAGGACATAATGCATGAGCGGATC
4,000 / 4,000 (100.0%)
3,183 / 4,000 reads had adapters trimmed from their start (102,974 bp removed) 1,489 / 4,000 reads had adapters trimmed from their end (22,429 bp removed)
Discarding reads containing middle adapters 4,000 / 4,000 (100.0%)
15 / 4,000 reads were discarded based on middle adapters
Saving trimmed reads to barcode-specific files
Barcode Reads Bases File
BC1 7 6,112 /Volumes/TOSHIBA EXT/3_Demultiplexed_trimmed/BC1.fastq BC2 6 3,774 /Volumes/TOSHIBA EXT/3_Demultiplexed_trimmed/BC2.fastq none 3,972 2,786,813 /Volumes/TOSHIBA EXT/3_Demultiplexed_trimmed/none.fastq
I assume that the issue is that barcode sets from ONT use unique barcodes for each sample (as in start_sequence would be barcode sequence and end_sequence would be the same sequence in reverse complement). Hence, when using an ONT barcode set, no sample would share any barcode sequence. And therefore reads are binned to the start_sequence "name" (BCx). This logic does not work in my case since some of my samples share the same start_sequence and porechop bins reads sharing a start_sequence to the same bin eventhough they have a different end_sequence.
For troubleshooting I tried to run the
verbosity 2
setting and I can see that all start and end sequences are found in the reads. I ran with--require_two_barcodes
and all reads get binned to "none". I used reverse, complement, and reverse-complement of the end_sequence barcode, with both default and--require_two_barcodes
setting.Nothing worked so far.
Could you please confirm on my suspicion of a logic problem and share your thoughts on how I could resolve this issue and get my samples demultiplexed?
I would be forever grateful!
All the best and many thanks in advance, Scott