Resolve Km entries to actual reference

matthiaskoenig commented 5 years ago

References are often complicated to resolve from the the flatfile. It would be necessary to parse the comments for that (and probably not possible to solve this in a clean way). See for instance the example below which in total states the 4 references for the protein, but it does not state which reference is for 2-oxoglutarate (there is only the comment information available.

This can currently not be solved.

Hi @matthiaskoenig

There is still some issue with the reference and comments of Km value. Multiple references and multiple comment lines are mapped to single Km value. On the webpage, each Km value is mapped to one reference and one comment. Kindly look into this.

Test case: ec = '2.6.1.42' organism = 'Homo sapiens' protein_id = 5 Output returned:

             {'comment': '#4# pH 8.0, 25°C, substrate L-valine <23,40>; '
                           '#5,17# pH 8.0, 37°C, isoenzyme III, valine as '
                           'amino group donor <24>; #63# half transamination '
                           'reaction, pH 8.0, 90°C <96>',
                'data': '1.7 {2-oxoglutarate}',
                'refs': [23, 24, 40, 96],
                'substrate': '2-oxoglutarate',
                'units': 'mM',
                'value': 1.7}

matthiaskoenig commented 5 years ago

To clarify this: The actual information in the flatfile is:

KM  #4,5,17,63# 1.7 {2-oxoglutarate}  (#4# pH 8.0, 25°C, substrate L-valine
    <23,40>; #5,17# pH 8.0, 37°C, isoenzyme III, valine as amino group
    donor <24>; #63# half transamination reaction, pH 8.0, 90°C <96>)
    <23,24,40,96>

There is no way to resolve for me which references are for which protein ids. This is only affecting a subset of entries, mostly there is a unique reference.

dotPiano commented 2 years ago

I just encountered the same issue. When you say that it cannot be fixed do you mean that there is not sufficient information or that it would be difficult with the current code structure?

If it's about the information, looking at this and other entries with the same problem it looks like:

All proteins/measurements having the same value for a certain substrate are pooled in the same line. The flatfile extract above seems to report four measurements for 2-oxoglutarate from different proteins/organisms.
The comment is a ";"-separated string that begins with the protein ID and ends with the references for that protein.

Isn't this sufficient to assign references to proteins? With the same approach it should be possible to assign to each protein only the portion of the comment that is relevant for it (at the moment all proteins will get all comments, making it hard to parse whether a protein is wildtype or mutant).

dotPiano commented 2 years ago

Ok, it's not as easy as I first thought but the function below works for me. In cases where at least one protein has a comment and at least one doesn't it still needs to fall back to returning all references (see example in the comment), but this is quite rare. Maybe this can be integrated directly in your parser?

comment_re = re.compile("#([^#]+)#([^<]+)<([^>]+)>")

def get_protein_comment_and_references(protein: BrendaProtein, entry: Dict[str, Any]):
    if "comment" in entry:
        tokens = entry["comment"].split(">;")
        for i in range(len(tokens) - 1):
            tokens[i] = tokens[i] + ">"

        for token in tokens:
            result = comment_re.match(token.strip())
            # Check if this is the comment for the protein we are interested in.
            if str(protein.protein_id) in result.group(1).split(","):
                comment = result.group(2).strip()
                reference_ids = [int(r) for r in result.group(3).split(",")]
                return comment, reference_ids

    # If at least one comment is empty then we really don't know how to map references.
    # In most cases only one protein is referenced, and returning all references is
    # correct anyway. However, in some cases it's unavoidable that we'll assign more
    # references than necessary to a single protein. For example protein 19 in:
    # KM    #19,41# 16 {ethanol}  (#41# pH 8.8 <28>) <28,71>
    return "", entry["refs"]

for entry in protein.KM:
    comment, reference_ids = get_protein_comment_and_references(protein, entry)

matthiaskoenig commented 2 years ago

@dotPiano Thanks for the contribution. I will incorporate this in the next release. I tried some heuristics before, but the problem is when testing over the complete BRENDA flatfile there are always exceptions ;). Sorry for the slow reply, I was on holidays.

dotPiano commented 2 years ago

Glad it helps! And I noticed the issue of exceptions as well. For example the code above needs to split comments on >; instead of ; and then add back the > because of one entry that contains ";" in the comment. ;)

No stress, I currently have this on my side and it works smoothly but I'm happy to rely fully on brendapy once you update it.

matthiaskoenig / brendapy

Resolve Km entries to actual reference #31