Closed kkaris closed 3 years ago
I will take a look at this tomorrow. :)
Actually, I think a one-time renaming step upon SIF dump could be done across the board, not just for collections of entries with inconsistent names. This would simplify the logic and also ensure that the naming is up to date and also coherent with respect to a particular ontology version.
I'm gonna make another update @pagreene, so hold off on merging until the commit(s) is/are in.
@pagreene, @bgyori I think we're good to go with the latest commits
@kkaris I realized an important issue related to this, namely that the SIF dump doesn't re-canonicalize some IDs that are (1) stored in a non-canonical form in the DB for practical purposes (2) are invalid due to old outputs from some sources from before the corresponding input processors were fixed. This is a good example of a problematic row:
agA_ns CHEBI
agA_id 17997
agA_name dinitrogen
agB_ns GO
agB_id 8283
agB_name cell population proliferation
Here, agA_id
would have to be CHEBI:17997
and agB_id
would have to be GO:0008283
in canonical form. This re-canonicalization would be important for downstream use in general but also play an important role for name normalization, since currently get_name
is called without canonicalization (see https://github.com/indralab/indra_db/pull/181/files#diff-1edba72f1d9103b3be40014bfe586dfc1197fa493422168ba98a1d587e0e771dR219-R229), it will often return None.
To get a list of validity issues in the dataframe, we can do something like
from indra.statements.validate import assert_valid_db_refs
for idx, row in df.iterrows():
try:
assert_valid_db_refs({row['agA_ns']: row['agA_id']})
except Exception as e:
print(idx, e)
I'll start a branch with some code from elsewhere where I fixed some of these validity issues, then you could tie that into a follow-up SIF dump.
This PR adds a check in the sif dump for entities that have the same grounding but different names. After the check, an attempt is made to normalize them using
indra.ontology.bio.BioOntology.get_name()
.