gyorilab / indra_db

A Database-based knowledge back-end built on and for INDRA. The INDRA Database is a service that can be set up by any user with their own content and knowledge access. Our implementation of the database is the back-end to many of our projects, providing a vast and detailed knowledge base derived from many resources.
GNU General Public License v3.0
16 stars 10 forks source link

Normalize names in sif dump #181

Closed kkaris closed 3 years ago

kkaris commented 3 years ago

This PR adds a check in the sif dump for entities that have the same grounding but different names. After the check, an attempt is made to normalize them using indra.ontology.bio.BioOntology.get_name().

pagreene commented 3 years ago

I will take a look at this tomorrow. :)

bgyori commented 3 years ago

Actually, I think a one-time renaming step upon SIF dump could be done across the board, not just for collections of entries with inconsistent names. This would simplify the logic and also ensure that the naming is up to date and also coherent with respect to a particular ontology version.

kkaris commented 3 years ago

I'm gonna make another update @pagreene, so hold off on merging until the commit(s) is/are in.

kkaris commented 3 years ago

@pagreene, @bgyori I think we're good to go with the latest commits

bgyori commented 3 years ago

@kkaris I realized an important issue related to this, namely that the SIF dump doesn't re-canonicalize some IDs that are (1) stored in a non-canonical form in the DB for practical purposes (2) are invalid due to old outputs from some sources from before the corresponding input processors were fixed. This is a good example of a problematic row:

agA_ns                                    CHEBI
agA_id                                    17997
agA_name                             dinitrogen
agB_ns                                       GO
agB_id                                     8283
agB_name          cell population proliferation

Here, agA_id would have to be CHEBI:17997 and agB_id would have to be GO:0008283 in canonical form. This re-canonicalization would be important for downstream use in general but also play an important role for name normalization, since currently get_name is called without canonicalization (see https://github.com/indralab/indra_db/pull/181/files#diff-1edba72f1d9103b3be40014bfe586dfc1197fa493422168ba98a1d587e0e771dR219-R229), it will often return None.

To get a list of validity issues in the dataframe, we can do something like

from indra.statements.validate import assert_valid_db_refs
for idx, row in df.iterrows():
    try:
        assert_valid_db_refs({row['agA_ns']: row['agA_id']})
    except Exception as e:
        print(idx, e)

I'll start a branch with some code from elsewhere where I fixed some of these validity issues, then you could tie that into a follow-up SIF dump.