marcelm / cutadapt

Cutadapt removes adapter sequences from sequencing reads
https://cutadapt.readthedocs.io
MIT License
513 stars 129 forks source link

Ensuring to find the correct file after Demultiplexing my 16S amplicon raw dataset with combinatorial dual indexes cutadapt command #776

Open JayalalKJ opened 5 months ago

JayalalKJ commented 5 months ago

Here are some of the output generated from the code above:

LIB1_sample_189-LIB1_sample_183.1.fastq.gz
LIB1_sample_189-LIB1_sample_183.2.fastq.gz
LIB1_sample_189-LIB1_sample_184.1.fastq.gz
LIB1_sample_189-LIB1_sample_184.2.fastq.gz
LIB1_sample_189-LIB1_sample_185.1.fastq.gz
LIB1_sample_189-LIB1_sample_185.2.fastq.gz
LIB1_sample_189-LIB1_sample_186.1.fastq.gz
LIB1_sample_189-LIB1_sample_186.2.fastq.gz
LIB1_sample_189-LIB1_sample_187.1.fastq.gz
LIB1_sample_189-LIB1_sample_187.2.fastq.gz
LIB1_sample_189-LIB1_sample_188.1.fastq.gz
LIB1_sample_189-LIB1_sample_188.2.fastq.gz
LIB1_sample_189-LIB1_sample_189.1.fastq.gz
LIB1_sample_189-LIB1_sample_189.2.fastq.gz
LIB1_sample_189-LIB1_sample_190.1.fastq.gz
LIB1_sample_189-LIB1_sample_190.2.fastq.gz
  1. What is the correct combination of fastq.gz files for further analysis? I have a total of 230 samples.
  2. How can I pick those right files from the combination pool folder?
marcelm commented 5 months ago

How many barcodes are in barcodes_fwd.fasta and barcodes_rev.fasta, respectively? Are you sure you have combinatorial dual indices? According to Illumina’s documentation, you can have up to 96 samples, so I wonder how the 230 samples fit in there.

JayalalKJ commented 5 months ago

Thanks for your question. It was 160 (set B- 96 and the rest in Set C)

kit used: NextFlex Rapid XP kit

Forward primer: GTGCCAGCMGCCGCGGTAA Reverse primer: GGACTACHVGGGTWTCTAAT

Two files have been received. sample1_1_L001_R1_001.fastq sample1_1_L001_R2_001.fastq
sample2_1_L001_R1_001.fastq sample2_1_L001_R2_001.fastq

head -n 4 sample2_1_L001_R1_001.fastq @A00783:1516:HWGV2DRX3:1:2101:5891:5055 1:N:0:CGCTGCTC+GATCTGCC GTGCCAGCCGCCGCGGTAATACATAGGATGCAAGCGTTATCCGGATTTACTGGGCGTAAAGCGAGCTCAGGCGGATTTACAAGTCTGATGTTAAAGACAACTGCTTAACGGTTGTTTGCATTGGAAACTGTAAGTCTAGAGTATAGTAGAGAGTTTTGGAACTCCATGTGGAGCGGTGGAATGCGTAGATATATGGAAGAACACCAGAGGCGAAGGCGAAAACTTAGGCTATAACTGACGCT +A00783:1516:HWGV2DRX3:1:2101:5891:5055 1:N:0:CGCTGCTC+GATCTGCC ,,:FFF,F,:FF:F:F:FFFF:F,F,FF:,FFF:F,,,,:FFF,F,,:FF::F:,:FFFF:F:F:F,FF::F:FF:,FFFFF,FF,,F,F,,FFFFFFFFF::FF:FF:FFF:FF:FF,,:,:FF:FF:::FF:FF,F:FFF,:F,F:FFFFF:F,:F,FFFFFFF:FFFFF:FFFFFFFF:F:FFFFFF:FFFFFFFFFFFFFFFFF::FFFFF:FFFFFFF:FFF,FFFFFFFFFFFFFF

_How many barcodes are in barcodes_fwd.fasta and barcodesrev.fasta, respectively?

Two 96 plates: B, and C barcode , example see below:

In Plate B, the barcode combination is as follows: "LIB1" "LIB1_sample_001" "AACAAGCC:GGAATGAG" "GTGCCAGCMGCCGCGGTAA" "GGACTACHVGGGTWTCTAAT" "F @ position=01_01A" "LIB1" "LIB1_sample_002" "TTACCGCT:CCAGTATG" "GTGCCAGCMGCCGCGGTAA" "GGACTACHVGGGTWTCTAAT" "F @ position=01_01B" "LIB1" "LIB1_sample_003" "TTACGCCA:TCATAGCG" "GTGCCAGCMGCCGCGGTAA" "GGACTACHVGGGTWTCTAAT" "F @ position=01_01C" "LIB1" "LIB1_sample_004" "GTGATCTC:ACTTGGCT" "GTGCCAGCMGCCGCGGTAA" "GGACTACHVGGGTWTCTAAT" "F @ position=01_01D" "LIB1" "LIB1_sample_005" "GGATAGCA:CAGTAGAC" "GTGCCAGCMGCCGCGGTAA" "GGACTACHVGGGTWTCTAAT" "F @ position=01_01E" "LIB1" "LIB1_sample_006" "AACCTCAG:GGATGATC" "GTGCCAGCMGCCGCGGTAA" "GGACTACHVGGGTWTCTAAT" "F @ position=01_01F" "LIB1" "LIB1_sample_007" "CAACCTCA:GGATTCGA" "GTGCCAGCMGCCGCGGTAA" "GGACTACHVGGGTWTCTAAT" "F @ position=01_01G" "LIB1" "LIB1_sample_008" "CGCCAATT:TAGCAAGG" "GTGCCAGCMGCCGCGGTAA" "GGACTACHVGGGTWTCTAAT" "F @ position=01_01H" "LIB1" "LIB1_sample_009" "GGAATGAG:AATTGCCG" "GTGCCAGCMGCCGCGGTAA" "GGACTACHVGGGTWTCTAAT" "F @ position=01_02A" "LIB1" "LIB1_sample_010" "CCAGTATG:TGAGATGC" "GTGCCAGCMGCCGCGGTAA" "GGACTACHVGGGTWTCTAAT" "F @ position=01_02B" "LIB1" "LIB1_sample_011" "TCATAGCG:TGAGGACA" "GTGCCAGCMGCCGCGGTAA" "GGACTACHVGGGTWTCTAAT" "F @ position=01_02C"

In Plate C, the barcode combination is as follows: "LIB1" "LIB1_sample_001" "AACAAGCC:AATTGCCG" "GTGCCAGCMGCCGCGGTAA" "GGACTACHVGGGTWTCTAAT" "F @ position=01_01A" "LIB1" "LIB1_sample_002" "TTACCGCT:TGAGATGC" "GTGCCAGCMGCCGCGGTAA" "GGACTACHVGGGTWTCTAAT" "F @ position=01_01B" "LIB1" "LIB1_sample_003" "TTACGCCA:TGAGGACA" "GTGCCAGCMGCCGCGGTAA" "GGACTACHVGGGTWTCTAAT" "F @ position=01_01C" "LIB1" "LIB1_sample_004" "GTGATCTC:CGATACAC" "GTGCCAGCMGCCGCGGTAA" "GGACTACHVGGGTWTCTAAT" "F @ position=01_01D" "LIB1" "LIB1_sample_005" "GGATAGCA:TAGCCACT" "GTGCCAGCMGCCGCGGTAA" "GGACTACHVGGGTWTCTAAT" "F @ position=01_01E" "LIB1" "LIB1_sample_006" "AACCTCAG:TATCTGGC" "GTGCCAGCMGCCGCGGTAA" "GGACTACHVGGGTWTCTAAT" "F @ position=01_01F" "LIB1" "LIB1_sample_007" "CAACCTCA:TTAGGCAC" "GTGCCAGCMGCCGCGGTAA" "GGACTACHVGGGTWTCTAAT" "F @ position=01_01G" "LIB1" "LIB1_sample_008" "CGCCAATT:GTCACAGA" "GTGCCAGCMGCCGCGGTAA" "GGACTACHVGGGTWTCTAAT" "F @ position=01_01H" "LIB1" "LIB1_sample_009" "GGAATGAG:AACAAGCC" "GTGCCAGCMGCCGCGGTAA" "GGACTACHVGGGTWTCTAAT" "F @ position=01_02A" "LIB1" "LIB1_sample_010" "CCAGTATG:TTACCGCT" "GTGCCAGCMGCCGCGGTAA" "GGACTACHVGGGTWTCTAAT" "F @ position=01_02B" "LIB1" "LIB1_sample_011" "TCATAGCG:TTACGCCA" "GTGCCAGCMGCCGCGGTAA" "GGACTACHVGGGTWTCTAAT" "F @ position=01_02C" "LIB1" "LIB1_sample_012" "ACTTGGCT:GTGATCTC" "GTGCCAGCMGCCGCGGTAA" "GGACTACHVGGGTWTCTAAT" "F @ position=01_02D" "LIB1" "LIB1_sample_013" "CAGTAGAC:GGATAGCA" "GTGCCAGCMGCCGCGGTAA" "GGACTACHVGGGTWTCTAAT" "F @ position=01_02E" "LIB1" "LIB1_sample_014" "GGATGATC:AACCTCAG" "GTGCCAGCMGCCGCGGTAA" "GGACTACHVGGGTWTCTAAT" "F @ position=01_02F" "LIB1" "LIB1_sample_015" "GGATTCGA:CAACCTCA" "GTGCCAGCMGCCGCGGTAA" "GGACTACHVGGGTWTCTAAT" "F @ position=01_02G"

I combined plates B and C and created barcodes_fwd.fasta and barcodes_rev.fasta files for each sample number. This means that plate B starts with LIB1_sample_001 and ends with LIB1_sample_096, while plate C starts with LIB1_sample_097 and ends with LIB1_sample_160.

barcodes_fwd.fasta

LIB1_sample_001 AACAAGCC LIB1_sample_002 TTACCGCT LIB1_sample_003 TTACGCCA LIB1_sample_004 GTGATCTC LIB1_sample_005 GGATAGCA LIB1_sample_006 AACCTCAG LIB1_sample_007 CAACCTCA ................................... LIB1_sample_160

barcodes_rev.fasta

LIB1_sample_001 GGAATGAG LIB1_sample_002 CCAGTATG LIB1_sample_003 TCATAGCG LIB1_sample_004 ACTTGGCT LIB1_sample_005 CAGTAGAC LIB1_sample_006 GGATGATC LIB1_sample_007 GGATTCGA ............................... LIB1_sample_160

marcelm commented 5 months ago

I still don’t have a clear picture of how your data is structured. I believe you need to understand this yourself before you can proceed. In particular, you need to figure out where the index sequences are.

First, to be explicit: There is a difference between unique dual indexing and combinatorial indexing. Unless you have reliable information that combinatorial indexing was used, it is more likely that unique dual indexing was done.

kit used: NextFlex Rapid XP kit

I am not familiar with it, but the manual for version 2 of that kit talks about Unique Dual Indices:

In addition, the availability of up to 1,536 different Unique Dual Index adapter barcodes facilitates high-throughput applications.

According to the same manual, the UDIs need to be bought separately, so it would still possible to use combinatorial indexing, but that would be against Illumina’s own advice:

Illumina recommends using unique dual indexing (UDI) whenever possible for the most accurate demultiplexing.

Your first read looks like this:

head -n 4 sample2_1_L001_R1_001.fastq

@A00783:1516:HWGV2DRX3:1:2101:5891:5055 1:N:0:CGCTGCTC+GATCTGCC
GTGCCAGCCGCCGCGGTAATACATAGGATGCAAGCGTTATCCGGATTTACTGGGCGTAAAGCGAGCTCAGGCGGATTTACAAGTCTGATGTTAAAGACAACTGCTTAACGGTTGTTTGCATTGGAAACTGTAAGTCTAGAGTATAGTAGAGAGTTTTGGAACTCCATGTGGAGCGGTGGAATGCGTAGATATATGGAAGAACACCAGAGGCGAAGGCGAAAACTTAGGCTATAACTGACGCT
...

Two files have been received. sample1_1_L001_R1_001.fastq sample1_1_L001_R2_001.fastq sample2_1_L001_R1_001.fastq sample2_1_L001_R2_001.fastq

Do all reads in the first file contain CGCTGCTC+GATCTGCC? What about the second file?

JayalalKJ commented 5 months ago

Hello there

Do all reads in the first file contain CGCTGCTC+GATCTGCC? What about the second file? Yes.

Checking LIB1_L2_1.fq for sequence CGCTGCTC+GATCTGCC... Number of occurrences in LIB1_1_L2_1.fq: 23013460 Total number of reads in LIB1_L2_1.fq: 23013460 Checking LIB1_L2_2.fq for sequence CGCTGCTC+GATCTGCC... Number of occurrences in LIB1_L2_2.fq: 23013460 Total number of reads in LIB1_L2_2.fq: 23013460 Checking LIB1_L1_1.fq for sequence CGCTGCTC+GATCTGCC... Number of occurrences in LIB1_L1_1.fq: 24176149 Total number of reads in LIB1_L1_1.fq: 24176149 Checking LIB1_L1_2.fq for sequence CGCTGCTC+GATCTGCC... Number of occurrences in LIB1_L1_2.fq: 24176149 Total number of reads in LIB1_L1_2.fq: 24176149

I had given a bad example,: here is right one from L001_1.fastq

In this string 'TTACCGCTGTGCCAGCAGCCGCGGTAA' of of L001_1, the first eight nucleotides are barcodesTTACCGCT, and the rest is forward primer.

@A00783:1516:HWGV2DRX3:1:2145:3992:12289 1:N:0:CGCTGCTC+GATCTGCC TTACCGCTGTGCCAGCAGCCGCGGTAACACATAGGATGCAAGCGTTATCCGGATTTACTGGGCGTAAAGCGAGCGCAGGCGGATTTACAAGTCTGATGTTAAAGACAACTGCTTAACGGTTGTTTGCATTGGAAACTGTAAGTCTAGAGTATAGTAGAGAGTTTTGGAACTCCATGTGGAGCGGTGGAATGCGTAGATATATGGAAGAACACCAGAGGCGAAGGCGAAAACTTAGGCTATAACTGACGCT + FFFFFF:FFFF:FFFFFFFFFF:FFFFFFFF:FFFFFFFFFFFFFFFFFF:FFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFF:FFFFFFFF:FFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFF:FFFFFFF:FFFF,FFFFFFFFFF:FFFFF::FFFFF:FFFF:FFFFFF,,FFF,FFFFFF:FF:FFF:FFF:FFFFFFFFFFF:FF:FFFFFFFFF

@A00783:1516:HWGV2DRX3:1:2146:4408:26099 1:N:0:CGCTGCTC+GATCTGCC GCCTTCGCCTCTGGTGTTCTTCCATATATCTACGCATTCCACCGCTCCACATGGAGTTCCAAAACTCTCTACTATACTCTAGACTTACAGTTTCCAATGCAAACAACCGTTAAGCAGTTGTCTTTAACATCAGACTTGTAAATCCGCCTGCGCCCGCTTTACGCCCAGTAAATCCGGATAACGCTTGCATCCTATGTGTTACCGCTGTGCCAGCAGCCGCGGTAACAGATCGGAAGAGCACACGTCTGAA + FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF,FF:,FF:FFFFFFFFFFF

@A00783:1516:HWGV2DRX3:1:2147:20889:26741 1:N:0:CGCTGCTC+GATCTGCC ATTACCGCTGTGCCAGCAGCCGCGGTAATACATAGGATGCAAGCGTTATCCGGATTTACTGGGCGTAAAGCGAGCGCAGGCGGATTTACAAGTCTGATGTTAAAGACAACTGCTTAACGGTTGTTTGCATTGGAAACTGTAAGTCTAGAGTATAGTAGAGAGTTTTGGAACTCCATGTGGAGCGGTGGAATGCGTAGAGATATGTAAGTCCCCCGGAGGCGAACGAGACACCTGAGGCGATCTCTGACGC + FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF::FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF,FFFFFFFFFF:FFFFFFFFF,FFFFFFFFFFFF,FF::F,F,:,,:,,F,:,,F,FF,,,,FF,F,F,,:F:,,F,,,F,,F,FF

@A00783:1516:HWGV2DRX3:1:2151:12608:26271 1:N:0:CGCTGCTC+GATCTGCC AGCCTAAGTTTTCGCCTTCGCCTCTGGTGTTCTTCCATATATCTACGCATTCCACCGCTCCACATGGAGTTCCAAAACTCTCTACTATACTCTAGACTTACAGTTTCCAATGCAAACAACCGTTAAGCAGTTGTCTTTAACATCAGACTTGTAAATCCGCCTGCGCTCGCTTTACGCCCAGTAAATCCGGATAACGCTTGCATCCTATGTATTACCGCTGTGCCAGCAGCCGCGGTAAAGATCGGAAGAG

L001_1.fastq

In this string 'TTACCGCTGGACTACAGGGGTATCTAAT' of L001_1 , the first eight nucleotides are barcodesTTACCGCT, and the rest is reverse primer.

@A00783:1516:HWGV2DRX3:1:2116:7536:34084 1:N:0:CGCTGCTC+GATCTGCC TTACCGCTGGACTACAGGGGTATCTAATCCTGTTTGCTCCCCACGCTTTCGCACCTGAGCGTCAGTCTTCGTCCAGGGGGCCGCCTTCGCCACCGGTATTCCTCCAGATCTCTACGCATTTCACCGCTACACCTGGAATTCTACCCCCCTCTACGAGACTCAAGCTTGCCAGTATCAGATGCAGTTCCCAGGTTGAGCCCGGGGATTTCACATCTGACTTAACAAACCGCCTGCGTGCGCTTTACGCCCA + ,FF:FFF:FFFFFFFF:F::FFFFFFF::F:F:FFFFFFFF:FFFF:FFFF:FFFFFF:FFFFF:FFF:FFF:FFFFFFFFFF:FFFFFFFFFFFF:FFFFF:FFFFFFFFF:FFFFFF:FFFFFFFFFFFFFFFFFFFFFF:,FFFFFFF:FFFFFFFFFFF::F,,F:FF,FFFF:FFF:FF::FF:F,FF::F::FF:,FFFF:::FFF:FF,FFFF:FFF:FF,F:F:,,FFFFF:FFFF,,FFFF

@A00783:1516:HWGV2DRX3:1:2117:22815:14325 1:N:0:CGCTGCTC+GATCTGCC CTTGTTACCGCTGGACTACAGGGGTATCTAATCCTGTTTGCTACCCACGCTTTCGAGCCTCAGTGTCAGTATGATGCCAGGAAGCTGCCTTCGCCATCGGTATTCCTTCAGATCTCTACGCATTTCACCGCTACACCTGAAATTCTACTTCCCTCTCACCTACTCTAGCCTAACAGTTTCAGATGCAGTTCCCAGGTTAAGCCCGGGGATTTCACATCTGACTTATCAAGCCACCTACGCTCGCTTTACG + FFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF,FFFFFFFFFFFFF

@A00783:1516:HWGV2DRX3:1:2121:20509:32659 1:N:0:CGCTGCTC+GATCTGCC CACCTTACCGCTGGACTACAGGGGTATCTAATCCTGTTTGCTCCCCACACTTTCGCACCTCAGCGTCAGTATCGAGCCAGTGAGCCGCCTTCGCCACTGGTGTTCCTCCGAATATCTACGAATTTCACCTCTACACTCGGAATTCCACTCACCTCTCTCGAACTCAAGACCAGGAGTTTACAAGGCAGTTCCAGGGTTGAGCCCTGGGATTTCACCCTATACTTTCTGATCCGCCTACGTGCGCTTTACG

marcelm commented 5 months ago

What is the correct combination of fastq.gz files for further analysis? I have a total of 230 samples. How can I pick those right files from the combination pool folder?

Do you know which 230 barcode combinations are possible? Then you look for only those in the following way:

If you do not know which combinations are possible, just use the 230 biggest files. That may not be entirely correct, but it may be good enough.