jdidion / atropos

An NGS read trimming tool that is specific, sensitive, and speedy. (production)
Other
120 stars 15 forks source link

How to interpret output from detect function #70

Closed parkerac closed 6 years ago

parkerac commented 6 years ago

Hi,

I'm trying to figure out what to use as my adapter sequence, and I'm not sure how to interpret the output of the detect function. Below I have included the output from 3 different pairs of RNA-seq files. What would I use as the adapter sequences based on this output? Thanks!

atropos detect -pe1 V71-T-SA08642_S14_L003_R1_001.fastq.gz -pe2 V71-T-SA08642_S14_L003_R2_001.fastq.gz 2018-07-19 12:32:25,069 INFO: This is Atropos 1.1.18 with Python 3.6.5 2018-07-19 12:32:25,073 INFO: Loading list of known contaminants from https://raw.githubusercontent.com/jdidion/atropos/master/atropos/adapters/sequencing_adapters.fa 2018-07-19 12:32:25,362 INFO: Detecting adapters and other potential contaminant sequences based on 12-mers in 10000 reads

======= Input 1

File: /panfs/pan.fsl.byu.edu/scr/usr/19/parkerac/complete_download/V71-T-SA08642_S14_L003_R1_001.fastq.gz Detected 2 adapters/contaminants:

  1. Longest kmer: GCTGGAGTGCAGTGGCTATTCACAGGCGCGATCCCACTACTGATCAGCACGGGAGTTTTGACCTGCTCCGTTTCCGACCTGGGCCGGTTCACCCCTCCTTAGGCAACCTGGTGGTCCCCCGCTCC Longest matching sequence: GCTGGAGTGCAGTGGCTATTCACAGGCGCGATCCCACTACTGATCAGCACGGGAGTTTTGACCTGCTCCGTTTCCGACCTGGGCCGGTTCACCCCTCCTTAGGCAACCTGGTGGTCCCCCGCTCCCGGGAGGTCACCATATTGATGCCG Abundance (full-length) in 10000 reads: 66 (0.7%) Number of k-mer matches: 2000224
  2. Longest kmer: GGAGTCTTGGAAGCTTGACTACCCTACGTTCTCCTACAAATGGACCTTGAGAGCTTGTTTGGAGGTTCTAGCAGGGGAGCGCAGCTACTCGTATACCCTTGACCGAAGACCGGTCCTCCTCTATC Longest matching sequence: GGAGTCTTGGAAGCTTGACTACCCTACGTTCTCCTACAAATGGACCTTGAGAGCTTGTTTGGAGGTTCTAGCAGGGGAGCGCAGCTACTCGTATACCCTTGACCGAAGACCGGTCCTCCTCTATCGGGGATGGTCGTCCTCTTCGACC Abundance (full-length) in 10000 reads: 35 (0.4%) Number of k-mer matches: 1111596

======= Input 2

File: /panfs/pan.fsl.byu.edu/scr/usr/19/parkerac/complete_download/V71-T-SA08642_S14_L003_R2_001.fastq.gz Detected 2 adapters/contaminants:

  1. Longest kmer: GATCGCCAGGGTTGATTCGGCTGATCTGGCTGGCTAGGCGGGTGTCCCCTTCCTCCCTCACCGCTCCATGTGCGTCCCTCCCGAAGCTGCGCGCTCGGTCGAAGAGGACGACCATCC Longest matching sequence: GATCGCCAGGGTTGATTCGGCTGATCTGGCTGGCTAGGCGGGTGTCCCCTTCCTCCCTCACCGCTCCATGTGCGTCCCTCCCGAAGCTGCGCGCTCGGTCGAAGAGGACGACCATCCCCA Abundance (full-length) in 10000 reads: 61 (0.6%) Number of k-mer matches: 1062774
  2. Longest kmer: GGAGTTCTGGGCTGTAGTGCGCTATGCCGATCGGGTGTCCGCACTAAGTTCGGCATCAATATGGTGACCTCCCGGGAGCGGGGGACCACCAGGTTGCCTAAGGAG Longest matching sequence: GGAGTTCTGGGCTGTAGTGCGCTATGCCGATCGGGTGTCCGCACTAAGTTCGGCATCAATATGGTGACCTCCCGGGAGCGGGGGACCACCAGGTTGCCTAAGGAGGGGTGAACCGGCCCAGGTCGGAAACGGAGCAGGTCAAAACTCCA Abundance (full-length) in 10000 reads: 98 (1.0%) Number of k-mer matches: 1659203

atropos detect -pe1 MKN-28-WT-n+44-SA08569_S53_L007_R1_001.fastq.gz -pe2 MKN-28-WT-n+44-SA08569_S53_L007_R2_001.fastq.gz 2018-07-19 12:35:41,431 INFO: This is Atropos 1.1.18 with Python 3.6.5 2018-07-19 12:35:41,443 INFO: Detecting adapters and other potential contaminant sequences based on 12-mers in 10000 reads

======= Input 1

File: /panfs/pan.fsl.byu.edu/scr/usr/19/parkerac/complete_download/MKN-28-WT-n+44-SA08569_S53_L007_R1_001.fastq.gz Detected 1 adapters/contaminants:

  1. Longest kmer: GCTGGAGTGCAGTGGCTATTCACAGGCGCGATCCCACTACTGATCAGCACGGGAGTTTTGACCTGCTCCGTTTCCGACCTGGGCCGGTTCACCCCTCCTTAGGCAACCTGGTGGTCCCCCGCTCCCG Longest matching sequence: GCTGGAGTGCAGTGGCTATTCACAGGCGCGATCCCACTACTGATCAGCACGGGAGTTTTGACCTGCTCCGTTTCCGACCTGGGCCGGTTCACCCCTCCTTAGGCAACCTGGTGGTCCCCCGCTCCCGGGAGGA Abundance (full-length) in 10000 reads: 35 (0.4%) Number of k-mer matches: 718706

======= Input 2

File: /panfs/pan.fsl.byu.edu/scr/usr/19/parkerac/complete_download/MKN-28-WT-n+44-SA08569_S53_L007_R2_001.fastq.gz Detected 1 adapters/contaminants:

  1. Longest kmer: GCTGTAGTGCGCTATGCCGATCGGGTGTCCGCACTAAGTTCGGCATCAATATGGTGACCTCCCGGGAGCGGGGGACCACCAGGTTGCCTAAGGA Longest matching sequence: GCTGTAGTGCGCTATGCCGATCGGGTGTCCGCACTAAGTTCGGCATCAATATGGTGACCTCCCGGGAGCGGGGGACCACCAGGTTGCCTAAGGAGGGGTGAACCGGCCCAGGTCGGAAACGGAGCAGGTCAAAACTCCCGTGCTGATCAGT Abundance (full-length) in 10000 reads: 48 (0.5%) Number of k-mer matches: 386039

atropos detect -pe1 106-N-SA08621_S77_L008_R1_001.fastq.gz -pe2 106-N-SA08621_S77_L008_R2_001.fastq.gz 2018-07-19 12:36:53,131 INFO: This is Atropos 1.1.18 with Python 3.6.5 2018-07-19 12:36:53,143 INFO: Detecting adapters and other potential contaminant sequences based on 12-mers in 10000 reads

======= Input 1

File: /panfs/pan.fsl.byu.edu/scr/usr/19/parkerac/complete_download/106-N-SA08621_S77_L008_R1_001.fastq.gz Detected 2 adapters/contaminants:

  1. Longest kmer: GGCTGGAGTGCAGTGGCTATTCACAGGCGCGATCCCACTACTGATCAGCACGGGAGTTTTGACCTGCTCCGTTTCCGACCTGGGCCGGTTCACCCCTCCTTAGGCAACCTGGTGGTCCCCC Longest matching sequence: GGCTGGAGTGCAGTGGCTATTCACAGGCGCGATCCCACTACTGATCAGCACGGGAGTTTTGACCTGCTCCGTTTCCGACCTGGGCCGGTTCACCCCTCCTTAGGCAACCTGGTGGTCCCCCCGCTCCCGGGAGGTCACCATATTGATGCC Abundance (full-length) in 10000 reads: 93 (0.9%) Number of k-mer matches: 1480229
  2. Longest kmer: CCTTAGGCAACCTGGTGGTCCCCCGCTCCCGGGAGGTCACCATATTGATGCCGAACTTAGTGCGGACACCCGATCGGCATAGCGCACTACAGCCCAGAACTCCTGG Longest matching sequence: CCTTAGGCAACCTGGTGGTCCCCCGCTCCCGGGAGGTCACCATATTGATGCCGAACTTAGTGCGGACACCCGATCGGCATAGCGCACTACAGCCCAGAACTCCTGGGCTCAAGCGATCCTCCCACCTCAGA Abundance (full-length) in 10000 reads: 95 (0.9%) Number of k-mer matches: 741937

======= Input 2

File: /panfs/pan.fsl.byu.edu/scr/usr/19/parkerac/complete_download/106-N-SA08621_S77_L008_R2_001.fastq.gz Detected 1 adapters/contaminants:

  1. Longest kmer: CAGGAGTTCTGGGCTGTAGTGCGCTATGCCGATCGGGTGTCCGCACTAAGTTCGGCATCAATATGGTGACCTCCCGGGAGCGGGGGACCACCAGGTTGCCTAAGGA Longest matching sequence: CAGGAGTTCTGGGCTGTAGTGCGCTATGCCGATCGGGTGTCCGCACTAAGTTCGGCATCAATATGGTGACCTCCCGGGAGCGGGGGACCACCAGGTTGCCTAAGGAGGGGGGAACCGGCCCAGGTCGGAAACGGAGCAGGTCAAAACTCC Abundance (full-length) in 10000 reads: 65 (0.7%) Number of k-mer matches: 1125540
jdidion commented 6 years ago

Hi there - the detect command identifies common sequences in your reads, and (optionally) compares them to a database of known adapters. Unfortunately this can fail to detect adapters in RNA-Seq data, because RNA-Seq is inherently non-uniform in coverage. Sequences in highly expressed genes may be present at higher frequency than adapter sequences. For example, when I BLAT the very first sequence (GCTGG…), I get a perfect match with the NR7SL1 gene.

One thing you can try is to specify ‘--detector known’ which will force it to only search for known adapter sequences. If that doesn’t work, then sadly this method won’t work for you. The detect command is still under development, so there may be improvements for RNA-Seq data in the future.

John

On Jul 19, 2018, at 4:09 PM, parkerac notifications@github.com wrote:

Hi,

I'm trying to figure out what to use as my adapter sequence, and I'm not sure how to interpret the output of the detect function. Below I have included the output from 3 different pairs of RNA-seq files. What would I use as the adapter sequences based on this output? Thanks!

atropos detect -pe1 V71-T-SA08642_S14_L003_R1_001.fastq.gz -pe2 V71-T-SA08642_S14_L003_R2_001.fastq.gz 2018-07-19 12:32:25,069 INFO: This is Atropos 1.1.18 with Python 3.6.5 2018-07-19 12:32:25,073 INFO: Loading list of known contaminants from https://raw.githubusercontent.com/jdidion/atropos/master/atropos/adapters/sequencing_adapters.fa https://raw.githubusercontent.com/jdidion/atropos/master/atropos/adapters/sequencing_adapters.fa 2018-07-19 12:32:25,362 INFO: Detecting adapters and other potential contaminant sequences based on 12-mers in 10000 reads

======= Input 1

File: /panfs/pan.fsl.byu.edu/scr/usr/19/parkerac/complete_download/V71-T-SA08642_S14_L003_R1_001.fastq.gz Detected 2 adapters/contaminants:

Longest kmer: GCTGGAGTGCAGTGGCTATTCACAGGCGCGATCCCACTACTGATCAGCACGGGAGTTTTGACCTGCTCCGTTTCCGACCTGGGCCGGTTCACCCCTCCTTAGGCAACCTGGTGGTCCCCCGCTCC Longest matching sequence: GCTGGAGTGCAGTGGCTATTCACAGGCGCGATCCCACTACTGATCAGCACGGGAGTTTTGACCTGCTCCGTTTCCGACCTGGGCCGGTTCACCCCTCCTTAGGCAACCTGGTGGTCCCCCGCTCCCGGGAGGTCACCATATTGATGCCG Abundance (full-length) in 10000 reads: 66 (0.7%) Number of k-mer matches: 2000224 Longest kmer: GGAGTCTTGGAAGCTTGACTACCCTACGTTCTCCTACAAATGGACCTTGAGAGCTTGTTTGGAGGTTCTAGCAGGGGAGCGCAGCTACTCGTATACCCTTGACCGAAGACCGGTCCTCCTCTATC Longest matching sequence: GGAGTCTTGGAAGCTTGACTACCCTACGTTCTCCTACAAATGGACCTTGAGAGCTTGTTTGGAGGTTCTAGCAGGGGAGCGCAGCTACTCGTATACCCTTGACCGAAGACCGGTCCTCCTCTATCGGGGATGGTCGTCCTCTTCGACC Abundance (full-length) in 10000 reads: 35 (0.4%) Number of k-mer matches: 1111596

Input 2

File: /panfs/pan.fsl.byu.edu/scr/usr/19/parkerac/complete_download/V71-T-SA08642_S14_L003_R2_001.fastq.gz Detected 2 adapters/contaminants:

Longest kmer: GATCGCCAGGGTTGATTCGGCTGATCTGGCTGGCTAGGCGGGTGTCCCCTTCCTCCCTCACCGCTCCATGTGCGTCCCTCCCGAAGCTGCGCGCTCGGTCGAAGAGGACGACCATCC Longest matching sequence: GATCGCCAGGGTTGATTCGGCTGATCTGGCTGGCTAGGCGGGTGTCCCCTTCCTCCCTCACCGCTCCATGTGCGTCCCTCCCGAAGCTGCGCGCTCGGTCGAAGAGGACGACCATCCCCA Abundance (full-length) in 10000 reads: 61 (0.6%) Number of k-mer matches: 1062774 Longest kmer: GGAGTTCTGGGCTGTAGTGCGCTATGCCGATCGGGTGTCCGCACTAAGTTCGGCATCAATATGGTGACCTCCCGGGAGCGGGGGACCACCAGGTTGCCTAAGGAG Longest matching sequence: GGAGTTCTGGGCTGTAGTGCGCTATGCCGATCGGGTGTCCGCACTAAGTTCGGCATCAATATGGTGACCTCCCGGGAGCGGGGGACCACCAGGTTGCCTAAGGAGGGGTGAACCGGCCCAGGTCGGAAACGGAGCAGGTCAAAACTCCA Abundance (full-length) in 10000 reads: 98 (1.0%) Number of k-mer matches: 1659203 atropos detect -pe1 MKN-28-WT-n+44-SA08569_S53_L007_R1_001.fastq.gz -pe2 MKN-28-WT-n+44-SA08569_S53_L007_R2_001.fastq.gz 2018-07-19 12:35:41,431 INFO: This is Atropos 1.1.18 with Python 3.6.5 2018-07-19 12:35:41,443 INFO: Detecting adapters and other potential contaminant sequences based on 12-mers in 10000 reads

======= Input 1

File: /panfs/pan.fsl.byu.edu/scr/usr/19/parkerac/complete_download/MKN-28-WT-n+44-SA08569_S53_L007_R1_001.fastq.gz Detected 1 adapters/contaminants:

Longest kmer: GCTGGAGTGCAGTGGCTATTCACAGGCGCGATCCCACTACTGATCAGCACGGGAGTTTTGACCTGCTCCGTTTCCGACCTGGGCCGGTTCACCCCTCCTTAGGCAACCTGGTGGTCCCCCGCTCCCG Longest matching sequence: GCTGGAGTGCAGTGGCTATTCACAGGCGCGATCCCACTACTGATCAGCACGGGAGTTTTGACCTGCTCCGTTTCCGACCTGGGCCGGTTCACCCCTCCTTAGGCAACCTGGTGGTCCCCCGCTCCCGGGAGGA Abundance (full-length) in 10000 reads: 35 (0.4%) Number of k-mer matches: 718706

Input 2

File: /panfs/pan.fsl.byu.edu/scr/usr/19/parkerac/complete_download/MKN-28-WT-n+44-SA08569_S53_L007_R2_001.fastq.gz Detected 1 adapters/contaminants:

Longest kmer: GCTGTAGTGCGCTATGCCGATCGGGTGTCCGCACTAAGTTCGGCATCAATATGGTGACCTCCCGGGAGCGGGGGACCACCAGGTTGCCTAAGGA Longest matching sequence: GCTGTAGTGCGCTATGCCGATCGGGTGTCCGCACTAAGTTCGGCATCAATATGGTGACCTCCCGGGAGCGGGGGACCACCAGGTTGCCTAAGGAGGGGTGAACCGGCCCAGGTCGGAAACGGAGCAGGTCAAAACTCCCGTGCTGATCAGT Abundance (full-length) in 10000 reads: 48 (0.5%) Number of k-mer matches: 386039 atropos detect -pe1 106-N-SA08621_S77_L008_R1_001.fastq.gz -pe2 106-N-SA08621_S77_L008_R2_001.fastq.gz 2018-07-19 12:36:53,131 INFO: This is Atropos 1.1.18 with Python 3.6.5 2018-07-19 12:36:53,143 INFO: Detecting adapters and other potential contaminant sequences based on 12-mers in 10000 reads

======= Input 1

File: /panfs/pan.fsl.byu.edu/scr/usr/19/parkerac/complete_download/106-N-SA08621_S77_L008_R1_001.fastq.gz Detected 2 adapters/contaminants:

Longest kmer: GGCTGGAGTGCAGTGGCTATTCACAGGCGCGATCCCACTACTGATCAGCACGGGAGTTTTGACCTGCTCCGTTTCCGACCTGGGCCGGTTCACCCCTCCTTAGGCAACCTGGTGGTCCCCC Longest matching sequence: GGCTGGAGTGCAGTGGCTATTCACAGGCGCGATCCCACTACTGATCAGCACGGGAGTTTTGACCTGCTCCGTTTCCGACCTGGGCCGGTTCACCCCTCCTTAGGCAACCTGGTGGTCCCCCCGCTCCCGGGAGGTCACCATATTGATGCC Abundance (full-length) in 10000 reads: 93 (0.9%) Number of k-mer matches: 1480229 Longest kmer: CCTTAGGCAACCTGGTGGTCCCCCGCTCCCGGGAGGTCACCATATTGATGCCGAACTTAGTGCGGACACCCGATCGGCATAGCGCACTACAGCCCAGAACTCCTGG Longest matching sequence: CCTTAGGCAACCTGGTGGTCCCCCGCTCCCGGGAGGTCACCATATTGATGCCGAACTTAGTGCGGACACCCGATCGGCATAGCGCACTACAGCCCAGAACTCCTGGGCTCAAGCGATCCTCCCACCTCAGA Abundance (full-length) in 10000 reads: 95 (0.9%) Number of k-mer matches: 741937

Input 2

File: /panfs/pan.fsl.byu.edu/scr/usr/19/parkerac/complete_download/106-N-SA08621_S77_L008_R2_001.fastq.gz Detected 1 adapters/contaminants:

Longest kmer: CAGGAGTTCTGGGCTGTAGTGCGCTATGCCGATCGGGTGTCCGCACTAAGTTCGGCATCAATATGGTGACCTCCCGGGAGCGGGGGACCACCAGGTTGCCTAAGGA Longest matching sequence: CAGGAGTTCTGGGCTGTAGTGCGCTATGCCGATCGGGTGTCCGCACTAAGTTCGGCATCAATATGGTGACCTCCCGGGAGCGGGGGACCACCAGGTTGCCTAAGGAGGGGGGAACCGGCCCAGGTCGGAAACGGAGCAGGTCAAAACTCC Abundance (full-length) in 10000 reads: 65 (0.7%) Number of k-mer matches: 1125540 — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/jdidion/atropos/issues/70, or mute the thread https://github.com/notifications/unsubscribe-auth/AAHrni2KJwoBNdNEoM2PdMhXswEWexoWks5uIOd9gaJpZM4VXBDt.

parkerac commented 6 years ago

Great! Thanks for the information. I tried running atropos with a file of known adapters, and it seems to be working.

jdidion commented 6 years ago

Great!

On Jul 20, 2018, at 10:44 AM, parkerac notifications@github.com wrote:

Great! Thanks for the information. I tried running atropos with a file of known adapters, and it seems to be working.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.