gui11aume / starcode

All pairs search and sequence clustering
GNU General Public License v3.0
90 stars 21 forks source link

##1 mismatch by cluster #26

Open penglbio opened 6 years ago

penglbio commented 6 years ago

sorry to trouble you. In a paper, I saw someone use your software(starcode)to cluster sequences within 1nt mismatch. the following is the paper title and description: title:Single-cell profiling of the developing mouse brain and spinal cord with split-pool barcoding description:We then used Starcode(45)to collapse UMIs of aligned reads that were within 1nt mismatch of another UMI

I am confused, because In your software, I didn't find a parameter to set. can you tell me did there is a method to solve this problem

ezorita commented 6 years ago

The parameter -d specifies the clustering distance (the number of mismatched nucleotides you want to allow). So, in your example with distance 1, you'd run starcode as follows:

starcode -d1 input-file.fastq

Hope it helps.

penglbio commented 6 years ago

I will try. Thank you very much.

penglbio commented 6 years ago

how about the fasta, I test like the following, but can't work. $ starcode -d 1 test_file.fasta running starcode with 1 thread reading input files FASTA format detected sorting progress: 100.00% message passing clustering AGGGCTTACAAGTATAGGCC 2 CCTCATTATTTGTCGCAATG 1 TGCGCCAAGTACGATTTCCG 1 TGGGCTTACAAGTATAGGCC 1

the last sequence just 1 mismatch with the first.

ezorita commented 6 years ago

Note that you are using message passing algorithm for clustering. Message passing has a parameter called --cluster-ratio which is set to 5 by default. This parameter sets a restriction on the ratio of sequences needed to cluster one sequence with another. So, in other words, by default two sequences will only be clustered together if the count of one is at least 5 time bigger than the count of the other.

In your example, you are running starcode with just a few sequences and default parameters. Note that the last and the first sequence did not cluster together because their cluster ratio is 2, i.e. the first has 2 counts and the last has only 1.

So, to solve this, do one of the following:

  1. Run starcode with the whole dataset (but make sure that each canonical sequence is supposed to be over-represented compared to the others).
  2. Run starcode with a smaller --cluster-ratio.
  3. Use spheres clustering algorithm (this set with the parameter -s).

Hope it helps.

bettycatherine commented 4 years ago

I am really confused. Starcode was used in that paper for UMI collapse, so I think they used starcode-umi but not starcode. Am I correct? I am also wondering if there is any advice on how to set sequence distance when we use starcode-umi. Thank you very much!

wangjianing-web commented 4 years ago

I am really confused. Starcode was used in that paper for UMI collapse, so I think they used starcode-umi but not starcode. Am I correct? I am also wondering if there is any advice on how to set sequence distance when we use starcode-umi. Thank you very much!

But the UMI(10bp) is in the R2.fq file, it said the cDNA reads (Read 1) were mapped to genome, and then used Starcode (45) to collapse UMIs of aligned reads that were within 1 nt mismatch of another UMI, assuming the two aligned reads were also from the same UBC. I don't konw if I should combine the UMI and read 1, but it can not mappepd to genome,I don know what is the correct method.

ezorita commented 4 years ago

Hi @wangjianing-web. I can't tell which is the correct method they used. You should contact the authors for more details on how they used starcode in their work. What I understand from your description is that they followed these steps: