Open matthiaskoenig opened 5 years ago
To clarify this: The actual information in the flatfile is:
KM #4,5,17,63# 1.7 {2-oxoglutarate} (#4# pH 8.0, 25°C, substrate L-valine
<23,40>; #5,17# pH 8.0, 37°C, isoenzyme III, valine as amino group
donor <24>; #63# half transamination reaction, pH 8.0, 90°C <96>)
<23,24,40,96>
There is no way to resolve for me which references are for which protein ids. This is only affecting a subset of entries, mostly there is a unique reference.
I just encountered the same issue. When you say that it cannot be fixed do you mean that there is not sufficient information or that it would be difficult with the current code structure?
If it's about the information, looking at this and other entries with the same problem it looks like:
Isn't this sufficient to assign references to proteins? With the same approach it should be possible to assign to each protein only the portion of the comment that is relevant for it (at the moment all proteins will get all comments, making it hard to parse whether a protein is wildtype or mutant).
Ok, it's not as easy as I first thought but the function below works for me. In cases where at least one protein has a comment and at least one doesn't it still needs to fall back to returning all references (see example in the comment), but this is quite rare. Maybe this can be integrated directly in your parser?
comment_re = re.compile("#([^#]+)#([^<]+)<([^>]+)>")
def get_protein_comment_and_references(protein: BrendaProtein, entry: Dict[str, Any]):
if "comment" in entry:
tokens = entry["comment"].split(">;")
for i in range(len(tokens) - 1):
tokens[i] = tokens[i] + ">"
for token in tokens:
result = comment_re.match(token.strip())
# Check if this is the comment for the protein we are interested in.
if str(protein.protein_id) in result.group(1).split(","):
comment = result.group(2).strip()
reference_ids = [int(r) for r in result.group(3).split(",")]
return comment, reference_ids
# If at least one comment is empty then we really don't know how to map references.
# In most cases only one protein is referenced, and returning all references is
# correct anyway. However, in some cases it's unavoidable that we'll assign more
# references than necessary to a single protein. For example protein 19 in:
# KM #19,41# 16 {ethanol} (#41# pH 8.8 <28>) <28,71>
return "", entry["refs"]
for entry in protein.KM:
comment, reference_ids = get_protein_comment_and_references(protein, entry)
@dotPiano Thanks for the contribution. I will incorporate this in the next release. I tried some heuristics before, but the problem is when testing over the complete BRENDA flatfile there are always exceptions ;). Sorry for the slow reply, I was on holidays.
Glad it helps! And I noticed the issue of exceptions as well. For example the code above needs to split comments on >;
instead of ;
and then add back the >
because of one entry that contains ";" in the comment. ;)
No stress, I currently have this on my side and it works smoothly but I'm happy to rely fully on brendapy once you update it.
References are often complicated to resolve from the the flatfile. It would be necessary to parse the comments for that (and probably not possible to solve this in a clean way). See for instance the example below which in total states the 4 references for the protein, but it does not state which reference is for 2-oxoglutarate (there is only the comment information available.
This can currently not be solved.