merenlab / illumina-utils

A library and collection of scripts to work with Illumina paired-end data (for CASAVA 1.7+ pipeline).
GNU General Public License v2.0
89 stars 31 forks source link

Options to report prefix and trim suffix sequences #24

Closed semiller10 closed 4 years ago

semiller10 commented 4 years ago

This is a pretty simple change which does not include the faster merging of reads with no mismatches in the overlap, another feature I've been working on. The changes are largely explained by the help messages for the three new command line arguments and required surprisingly few modifications to the code.

The --trim-suffix option is critical for merging reads with a distribution of insert sizes resulting in a mixture of completely and partially overlapping inserts, as --marker-gene-stringent truncates merged partially overlapping reads.

The --report-r1-prefix and --report-r2-prefix report "prefix" sequences from merged reads. An example prefix sequence can be a unique molecular identifier of six random nucleotides located at the beginning of read 1 (immediately before the insert), which in the config file would be specified by the regex string, .......

I tested these changes with datasets spanning a range of sizes and a combination of options.

meren commented 4 years ago

Thank you very much, @semiller10! :)