redundant information filter

matthiaskoenig / brendapy

BRENDA parser in python

GNU Lesser General Public License v3.0

19 stars 6 forks source link

redundant information filter #30

Closed Karrenbelt closed 4 years ago

Karrenbelt commented 4 years ago

there is some redundant information stored in BRENDA (duplicate km, ki, kcats). It would be nice to have the option to filter these, for which I suggest:

def remove_duplicate_dicts_from_list(l : List[Dict]):
    new_l = []
    for i in range(len(l)):
        if l[i] not in l[i + 1:]:
            new_l.append(l[i])
    return new_l

let me know what you think, or if I'm wrong about the redundant information

matthiaskoenig commented 4 years ago

@Karrenbelt Thanks for reporting this. Could you provide some concrete example, i.e., ec and protein id? This will make it much easier to implement and test this.

Karrenbelt commented 4 years ago

certainly, the first one I run into if I parse all E.C. numbers is E.C. 1.1.1.1 for Mus Musculus KI's. It contains 27 entries, of which only 9 are unique. I just tested how often this occurs: it occurs 5674 times in total parsing all E.C. numbers for KI, KM and TN's, and interestingly all of them are multiples of 3.

I had to zip the pickled object to be able to attach it example.p.zip

matthiaskoenig commented 4 years ago

This is a bit more complicated to fix, because I have to ensure that the complete dictionary is unique. The underlying problem is that the BRENDA flat file writes the same KI 3 times in the file.

matthiaskoenig commented 4 years ago

This is now fixed and the underlying issue is reported to BRENDA. Functionality will be in the next release.