Closed hickst closed 8 years ago
Since the beginning we've imagined keeping track of synonymous grounding for protein families and complexes where available. We have some options for how to do this:
1) Add these in as "isa" or "hasSynonym" in relations.csv, e.g., BE:RAS hasSynonym PFAM:PF00071 2) Create new file, e.g. equivalences.csv, with this information. 3) Add grounding synonyms as additional columns in entities.csv
As you've noticed, the grounding_map itself contains synonym information of this type, in a fairly redundant fashion (the synonym information contained in the columns will repeat over and over for different text strings). So refactoring the grounding_map and relations is a good idea.
@bgyori and I discussed this before and are leaning toward (1).
I would opt for having a folder with separate files for mapping from different sources (PFAM, NCIT, BEL) into Bioentities. I already have one such file which currently lives within INDRA but could be moved over to Bioentities: https://github.com/sorgerlab/indra/blob/master/indra/resources/bel_indra_map.tsv This is an example where the first column is an entry in a BEL name space and the second column is an entry in the BE name space. How about moving this over to Bioentities into a mappings folder and adding a few more such files for other sources?
Perhaps I should describe in more detail what I'm doing with the BE resources. I've written a custom program to extract the "target" entities from the relations file. The program then walks through the _groundingmap file and adds "lexical" strings: keys for entries which point to any BE entities extracted in the first step. The final result is two lookup tables (complexes and families) which map the "surface strings" to the canonical BE names.
So, for example, 'ATK' is extracted from the relations file as a family name. Then, from the _groundingmap file, the program adds alternate lexical strings which map to AKT, such as 'Akt', 'AKT-Ser473', 'p-AKT', 'PKB', etc. The addition of the lexical synonyms to our lookup tables is crucial, since we will likely encounter these in reading.
In #21 I added a new file that establishes that equivalence between BE entries and the PF/IP entries that @hickst sent. So with this in place, if REACH maps to, for instance, PF:PF00071, that will be equivalent to BE:RAS.
Good idea, but please see my comment on your PR #21 about namespace strings.
Okay, I merged #21 as is for now. So REACH can keep grounding to Pfam when the correct entry is available there and equivalences.csv can be used to map into Bioentities to find the unpacking of the family members.
Hi Guys,
This is not really an issue but... I've created two KBs for Reach from the Bioentities files: a protein complexes KB and a protein families KB. I'm currently in the process of updating our Override KB, which has the highest priority. My first step is to remove any entries which can now be found in the two new BE KBs, as the existing override entries will "block" resolution to the BE KB entries.
But, I'm somewhat reluctant to remove Override family entries which already map to PFAM IDs, unless you (in your superior Biological wisdom) think they are incorrectly mapped, so I'd greatly appreciate your advice on the "clashing" entries below:
ACOX PF01756 http://pfam.xfam.org/family/ACOX BMP PF02608 http://pfam.xfam.org/family/BMP Cadherin PF00028 http://pfam.xfam.org/family/Cadherin COX4 PF02936 http://pfam.xfam.org/family/COX4 COX6A PF02046 http://pfam.xfam.org/family/COX6a COX6B PF02297 http://pfam.xfam.org/family/COX6b COX7A PF02238 http://pfam.xfam.org/family/COX7a COX7B PF05392 http://pfam.xfam.org/family/COX7b COX8 PF02285 http://pfam.xfam.org/family/COX8 CRISP PF08562 http://pfam.xfam.org/family/CRISP DDR PF08841 http://pfam.xfam.org/family/DDR DVL PF08137 http://pfam.xfam.org/family/DVL ETS PF00178 http://pfam.xfam.org/family/ETS FGF PF00167 http://pfam.xfam.org/family/FGF FLOT PF15975 http://pfam.xfam.org/family/FLOT GATA PF00320 http://pfam.xfam.org/family/GATA HSP90 PF00183 http://pfam.xfam.org/family/HSP90 IGFBP PF00219 http://pfam.xfam.org/family/IGFBP IL1 PF00340 http://pfam.xfam.org/family/IL1 IRS PF02174 http://pfam.xfam.org/family/IRS MAF PF02545 http://pfam.xfam.org/family/MAF NOTCH PF00066 http://pfam.xfam.org/family/NOTCH PKI PF02827 http://pfam.xfam.org/family/PKI RAS PF00071 http://pfam.xfam.org/family/RAS SAA PF00277 http://pfam.xfam.org/family/SAA TGFB IPR015615 http://www.ebi.ac.uk/interpro/entry/IPR015615