merenlab / illumina-utils

A library and collection of scripts to work with Illumina paired-end data (for CASAVA 1.7+ pipeline).
GNU General Public License v2.0
89 stars 31 forks source link

Merging ITS amplicons #1

Closed meren closed 10 years ago

meren commented 10 years ago

ITS amplicons can have a great variation in length. Therefore the insert size may become too small to have a partial overlap. Better yet, the insert size may be long enough for a partial overlap, but then after trimming the prefixes from both reads you may run into a situation that requires complete overlap analysis instead of partial overlap.

Here is an example. Say this is read 1 in one of your ITS paired-end sequences:

@M01028:24:000000000-A49NB:1:1101:16134:1723 1:N:0:21
TACGTCAGCGTAAAAGTCGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTACTGTTATTTACTACTACACTGCGTGAGCGGAACGAAAACAACAACACCTAAAATGTGGAATATAGCATATAGTCGACAAGAGAAATCTACGAAAAAACAAACAAAACTTTCAACAACGGATCTCTTGGTTCTCGCATCGATGAAGAGCGCAGCGAAATGCGATACCTAGTGTGAATTGCAGCCGTCGTGAAT
+
BBBBBFBBFBBBGGGGGGGFGGEHHHHHHHHHGHHGGHHHHHHHHHGGEGGGHBGHFHGHHHH5DFGHHHHHHHHHHHHHG>?B?EFEGGGGGGGDGHHHHHFHHGGH3F?3BFFGHFHHHHHHHBDGHHHHH/F/CFHHHHHHHHHHHGGHGGHGGHHHHHH.GHHHHHHHHHHHGGGGGHHHHGGGGGGGGGGGGGGGGFGGGGFGGG-@DF-@EFEDFFFFFFFFFFFFFFFFFFFFFFAFCFDA/A/

and this is read 2:

@M01028:24:000000000-A49NB:1:1101:16134:1723 2:N:0:21
GTTCAAAGATTCGATGATTCACGACGGCTGCAATTCACACTAGGTATCGCATTTCGCTGCGCTCTTCATCGATGCGAGAACCAAGAGATCCGTTGTTGAAAGTTTTGTTTGTTTTTTCGTAGATTTCTCTTGTCGACTATATGCTATATTCCACATTTTAGGTGTTGTTGTTTTCGTTCCGCTCACGCAGTGTAGTAGTAAATCACAGTAATGATCCTTCCGCAGGTTCACCTACGGAAACCTTGTTACGA
+
AABABFFFFFFFGGGGGGG6FGHGAAEEEGGFHGHFFF5BBB3BDFGGGGGGGHHHGGGGGGCGHHHHHHHHHHFEGGGGHHHHHHHGHHGHG3EGHH4BFFHHHHHGHHHHHHHGGHGHHGEHHFHHHHHHH1DGCGCHFGHFGDGFHHFGGHFGHHGF0GGGHGHHHHFHHHH?GHGGF?EGGHHG@BEFFBFFFBFFF0;FFFFFGGBF9BFF0FFGBFB@B;FFFFFFFFFFFF.BBFFBFBFBB?9

and this is your config.ini file to merge them:

[general]
project_name = CGCCTT_NNNNTCAGC_1
researcher_email = someone@somewhere.edu
input_directory = /somewhere/on/your/disk
output_directory = /somewhere/on/your/disk/output

[files]
pair_1 = r1
pair_2 = r2

# following section is optional
[prefixes]
pair_1_prefix = ^....TCAGCGTAAAAGTCGTAACAAGGTTTC
pair_2_prefix = ^GTTCAAAGA[C,T]TCGATGATTCAC

And if you run these like this:

merge-illumina-pairs short.ini

read 1 and 2 will not merge. Because, after trimming prefixes, they will want to be aligned like this after the reverse-complementing read 2:

read 1:                 xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
read 2: xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

unlike our expected partial overlap situation:

read 1: xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
read 2:                 xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

To solve this issue a new parameter is required. When m/o (mismatches at the overlapped region) fails miserably, the algorithm should check "the other way around" alternative, and pick the one with the best m/o.