transcript / samsa2

SAMSA pipeline, version 2.0. An open-source metatranscriptomics pipeline for analyzing microbiome data, built around DIAMOND and customizable reference databases.
GNU General Public License v3.0
53 stars 36 forks source link

How to deal with the same proteins with slightly different names from the RefSeq_bac database? #52

Open memoll opened 3 years ago

memoll commented 3 years ago

Hi Sam,

In my statistical analysis, some of the functions from the RefSeq_bac database are being categorized as different proteins only because of a small difference in their names like a dash (e.g. "(3R)-hydroxymyristoyl ACP dehydratase" "(3R)-hydroxymyristoyl-ACP dehydratase"), a comma, or lower/uppercase letters (e.g. "(2fe-2S)-binding domain-containing protein" and "(2Fe-2S)-binding domain-containing protein"). Also, some others are partial or complete sequences of the same protein (e.g. "(2Fe-2S) ferredoxin" and "(2Fe-2S) ferredoxin, partial").

I wanted to know if you correct those names in the database or after annotation-aggregation. And if yes, would you please guide me on how to do it?

-Mona

transcript commented 3 years ago

Hi Mona,

Currently, I don't have a correction for this. I've considered it, but there's only a limited amount that I can do to counteract the variety of naming conventions used in different RefSeq entries. I'm hesitant to force uppercase or lowercase, as this may obscure some names.

If you can provide me as many examples as possible, I could probably develop a script that would run on the results of a search to "sanitize" them (with a warning that there may be some loss of information as it attempts to correct them), but I don't see a way to correct this in the RefSeq database itself.

If you have suggestions, please let me know.

memoll commented 3 years ago

Hi Sam,

First of all, I'd like to thank you for your quick responses.

I am currently looking into other databases to see if I can get better annotation results of soil organisms and functions. But I'll try to look into my previous annotations and find more examples ASAP.