gui11aume / starcode

All pairs search and sequence clustering
GNU General Public License v3.0
90 stars 21 forks source link

compatible with split-seq data #36

Closed bettycatherine closed 4 years ago

bettycatherine commented 4 years ago

Hello, I am working with split-seq data, and from the original split-seq paper, they used starcode to collapse umi. The raw data of split-seq were paired-end reads, with read 1 comprised cDNA and read 2 mainly comprised umi and barcodes and other linkers, the UMI was the first 10 bp on read 2. SO the question is how should I give these data and information to starcode-umi, cause if I understand correctly, the sequence distance of starcode-umi means the distance between cDNA? Starcode-umi clusters cDNA first and cluster UMI from similar cDNA second to collapse UMI, so should I extract those 10bp UMI from read 2 and attach them to read 1? This is really bother me and any answer will be highly appreciated. Thank you!

Betty

ezorita commented 4 years ago

Hi Betty,

I don't know the exact details on how they used starcode in the split-seq paper. However, per your description the answer is yes. Starcode-umi expects to have a single sequence which contains the UMI first, then the sequence to be clustered.

So preprocess the sequences to have single sequences of UMI followed by cDNA, then run starcode-umi and use the option --umi-len 10 to tell starcode that your UMI are 10bp long. You can define the match distance of both the UMI part and the sequence part passing the options --umi-d and --seq-d respectively. If you don't set them, the defaults are distance 0 for UMI (exact matches) and an automatic distance for the sequence depending on its length.

Eduard

bettycatherine commented 4 years ago

Thank you very much for your reply, and it is very clear to me!

Betty