gui11aume / starcode

All pairs search and sequence clustering
GNU General Public License v3.0
90 stars 21 forks source link

Offending sequences #37

Open kjkjindal opened 3 years ago

kjkjindal commented 3 years ago

Hi, I am trying to run starcode sphere clustering on a set of sequences. These sequences contain certain (non-DNA) prefixes that I need to retain. I notice that starcode aborts when it encounters non-DNA characters in a sequence. Is this constraint essential to its (or specifically the sphere clustering algorithm's) function?

Thanks!

gui11aume commented 3 years ago

Hi! The issue is not sphere clustering per se but sequence clustering itself. If two identical sequences have different non-DNA tags, how do you suggest to group the sequences in the same cluster?

I am not sure what your biological problem is, but I would recommend to approach it this way:

  1. Extract the pure DNA suffixes (make sure the lines match with the original file).
  2. Run starcode on the DNA suffixes and use the flag --seq-id.
  3. Use the row numbers in the output to get the clusters from the original file.