Clustering with UMIs - Githubissues

gui11aume / starcode

All pairs search and sequence clustering

GNU General Public License v3.0

90 stars 21 forks source link

Clustering with UMIs #14

Closed pjsample closed 8 years ago

pjsample commented 8 years ago

Hello,

Is there currently a way to cluster sequences taking into consideration UMIs? Specifically, I would like to cluster sequences based on two distinct regions of a single read, each with different Levenshtein distances.

[Sequence1 (UMI) N8]--[Sequence2 N50]

Sequence1: d = 0 Sequence2: d=3

Alternatively, I could put the UMI as read 2, if that could be exploited in some way.

Thanks, Paul

ezorita commented 8 years ago

Hi Paul,

This script should do the trick. How big is your input file? Do you have an idea of how many different UMIs?

umi.py.tar.gz

Set the following variables in the script before running: STARCODE_PATH: absolute path to starcode bin file. STARCODE_THREADS: # threads starcode will be run on. UMI_L: Length of the UMI sequence. UMI_D: Allowed mismatches for the UMI sequence. SEQ_D: Allowed mismatches for the rest of the sequence.

The script assumes the input sequences are all the same length. Let me know how it works.

Eduard

pjsample commented 8 years ago

Hey Eduard,

Thanks for the script!

With a little modification, it seems to be working quite well. It was clustering first by UMIs (sequence 1) and then by sequence2. The goal was to group first by sequence2 and then count the number of unique UMIs for each grouped sequence2. I simply reversed the order and it works.

Thanks again!