Closed pjsample closed 8 years ago
Hi Paul,
This script should do the trick. How big is your input file? Do you have an idea of how many different UMIs?
Set the following variables in the script before running: STARCODE_PATH: absolute path to starcode bin file. STARCODE_THREADS: # threads starcode will be run on. UMI_L: Length of the UMI sequence. UMI_D: Allowed mismatches for the UMI sequence. SEQ_D: Allowed mismatches for the rest of the sequence.
The script assumes the input sequences are all the same length. Let me know how it works.
Eduard
Hey Eduard,
Thanks for the script!
With a little modification, it seems to be working quite well. It was clustering first by UMIs (sequence 1) and then by sequence2. The goal was to group first by sequence2 and then count the number of unique UMIs for each grouped sequence2. I simply reversed the order and it works.
Thanks again!
Hello,
Is there currently a way to cluster sequences taking into consideration UMIs? Specifically, I would like to cluster sequences based on two distinct regions of a single read, each with different Levenshtein distances.
[Sequence1 (UMI) N8]--[Sequence2 N50]
Sequence1: d = 0 Sequence2: d=3
Alternatively, I could put the UMI as read 2, if that could be exploited in some way.
Thanks, Paul