phglab / ALFATClust

Biological sequence clustering tool with dynamic threshold
GNU General Public License v3.0
23 stars 6 forks source link

"The input sequences consist of both DNA and protein sequences" #3

Closed xiekunwhy closed 2 years ago

xiekunwhy commented 2 years ago

Hi,

I am very sure that sequences in the input file are all DNA sequences (just subset from reference genome), but ALFATClust told me "The input sequences consist of both DNA and protein sequences". Why?

python alfatclust.py -i osa.ltr.fa -o osa.ltr.alfatclust image

Here is the file I have tested. osa.ltr.zip

Best, Kun

jimmykhchiu commented 2 years ago

Hi Kun,

The sequence annotated as 'LTRrLTR_harvest03493' in your file is found to consist of a long segment of 'N' characters. In order to ensure high quality clusters for DNA sequences, ALFATClust requires every input sequence containing mainly (at least 95% of its sequence length) non-ambiguous DNA codes (i.e. A/C/G/T). Unlike the other two sequences (also containing ambiguous 'N' symbols) in the file, this sequence does not meet the 95% criteria, but it still (theoretically) qualifies as an amino acid sequence. The error message is therefore shown. It is understood that the 'N' symbols often originate from the source data, so if you don't want to cluster your sequences by leaving this sequence out, you may consider lowering the criteria to 90% (by changing 0.95 to 0.9 in line 11 of the module script 'Utils.py' under the 'modules' folder) so that this sequence can meet the requirement. The compromise is possible lower cluster quality depending on how many sequences contain these ambiguous symbols and how long these segments are.

Thank you for your feedback and feel free to let us know if you have further questions.

Jimmy

jimmykhchiu commented 2 years ago

Hi @xiekunwhy, may I know whether the problem above has been solved? I will close this issue if no further follow-up is required.