jstjohn / SeqPrep

Tool for stripping adaptors and/or merging paired reads with overlap into single reads.
MIT License
140 stars 51 forks source link

quality score adding #2

Closed t-manny closed 13 years ago

t-manny commented 13 years ago

This is a really useful tool. Great job!

One issue I am having trouble with is capping the max quality scores. Adding makes sense in theory but its been shown that these scores severely underestimate the actual error rates when >30. Also, some popular aligners like novoalign crash when they see scores >40. Setting to a max value of 40 would be desirable.

I tried setting the MAX_QUAL=73 which works most of the time, but I still see some funny things happening. e.g.:

@ID23436_L CGGCGGGCAGCAGCAGAGTCTTCTTGTCCCACAGCACCCCAG + GGGGGFFFFEHGHHHHHEGGGHHHHHFHHHHHHHHHHHHHHH @ID23438_L GGGGACCACTGGGGAGTGAGAAATGAGCCCCTTCTCAACACCTAAGGGGGACCTGCCTCCATCCCTGACCTCTCTCCTACCCCCCT + iknefklghhkej]eeggijekkeilgmkkmjfkjkl`lejmmkkhfjhfgmimibglkklhkSjed_aibjihiln_nmmnnfni

Notice the last line where qualities are huge but comes from merging:

@p248622 GGGGACCACTGGGGAGTGAGAAATGAGCCCCTTCTCAACACCTAAGGGGGACCTGCCTCCATCCCTGACCTCTCTCCTACCCCCCTAGAT + HHHEHHHGHHGFHEFGDGGGBGGGFHFHFFHFEGHGH@HHFHHEHBAECDBFBFF;@EDDEEE.EC>;;C=DDCCFG8GFFGG?GBE8CE

@p248622 AGGGGGGTAGGAGAGAGGTCAGGGATGGAGGCAGGTCCCCCTTAGGTGTTGAGAAGGGGCTCATTTCTCACTCCCCAGTGGTCCCCAGAT + HHHHHHHHHHGGFFGFGGEGCFFGDHHHHHHDHHHFCFFFGDGFFE>EAEECEBEFFFFBED?EEDDCAD?@9C@EAAAED?AGDB3B>>

Any quick fix you can think of?

jstjohn commented 13 years ago

Hello, thanks for pointing out this bug! I found the part of my code which doesn't respect the maximum quality score and fixed that. Also I added a new parameter 'y' which allows you to change the maximum quality character (currently '['). Just pull the most recent version and you should have this fix.