Open memoll opened 4 years ago
Hi Mona,
Currently, I don't have a correction for this. I've considered it, but there's only a limited amount that I can do to counteract the variety of naming conventions used in different RefSeq entries. I'm hesitant to force uppercase or lowercase, as this may obscure some names.
If you can provide me as many examples as possible, I could probably develop a script that would run on the results of a search to "sanitize" them (with a warning that there may be some loss of information as it attempts to correct them), but I don't see a way to correct this in the RefSeq database itself.
If you have suggestions, please let me know.
Hi Sam,
First of all, I'd like to thank you for your quick responses.
I am currently looking into other databases to see if I can get better annotation results of soil organisms and functions. But I'll try to look into my previous annotations and find more examples ASAP.
Hi Sam,
In my statistical analysis, some of the functions from the RefSeq_bac database are being categorized as different proteins only because of a small difference in their names like a dash (e.g. "(3R)-hydroxymyristoyl ACP dehydratase" "(3R)-hydroxymyristoyl-ACP dehydratase"), a comma, or lower/uppercase letters (e.g. "(2fe-2S)-binding domain-containing protein" and "(2Fe-2S)-binding domain-containing protein"). Also, some others are partial or complete sequences of the same protein (e.g. "(2Fe-2S) ferredoxin" and "(2Fe-2S) ferredoxin, partial").
I wanted to know if you correct those names in the database or after annotation-aggregation. And if yes, would you please guide me on how to do it?
-Mona