mapping-commons / sssom-py

Python toolkit for SSSOM mapping format
https://mapping-commons.github.io/sssom-py/index.html#
MIT License
49 stars 12 forks source link

Mapping set inversion fails if the set contains a `subject_*` column but no corresponding `object_*` column #554

Open gouttegd opened 3 hours ago

gouttegd commented 3 hours ago

Trying to sssom invert a mapping set that contains a subject_label column but no object_label yields an error because sssom-py somehow expects to find a subject_label column in the inverted set, even though the inverted set will (logically) only contain a object_label column.

Example: given the following minimalist set:

#curie_map:
#  COMENT: https://example.com/entities/
#  ORGENT: https://example.org/entities/
#mapping_set_id: https://example.org/sets/exo2c
subject_id      subject_label  predicate_id    object_id       mapping_justification
ORGENT:0001     alice      skos:closeMatch COMENT:0011     semapv:ManualMappingCuration

the following command:

sssom-dev invert sample.sssom.tsv -o inverted.sssom.tsv

yields the following error:

Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/home/damien/.cache/pypoetry/virtualenvs/sssom-tFe7wmy3-py3.9/lib64/python3.9/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
  File "/home/damien/.cache/pypoetry/virtualenvs/sssom-tFe7wmy3-py3.9/lib64/python3.9/site-packages/click/core.py", line 1078, in main
    rv = self.invoke(ctx)
  File "/home/damien/.cache/pypoetry/virtualenvs/sssom-tFe7wmy3-py3.9/lib64/python3.9/site-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/damien/.cache/pypoetry/virtualenvs/sssom-tFe7wmy3-py3.9/lib64/python3.9/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/damien/.cache/pypoetry/virtualenvs/sssom-tFe7wmy3-py3.9/lib64/python3.9/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
  File "/home/damien/dvlpt/semweb/python/sssom-py/src/sssom/cli.py", line 769, in invert
    msdf.df = invert_mappings(
  File "/home/damien/dvlpt/semweb/python/sssom-py/src/sssom/util.py", line 1467, in invert_mappings
    inverted_df = inverted_df[df.columns]
  File "/home/damien/.cache/pypoetry/virtualenvs/sssom-tFe7wmy3-py3.9/lib64/python3.9/site-packages/pandas/core/frame.py", line 3767, in __getitem__
    indexer = self.columns._get_indexer_strict(key, "columns")[1]
  File "/home/damien/.cache/pypoetry/virtualenvs/sssom-tFe7wmy3-py3.9/lib64/python3.9/site-packages/pandas/core/indexes/base.py", line 5877, in _get_indexer_strict
    self._raise_if_missing(keyarr, indexer, axis_name)
  File "/home/damien/.cache/pypoetry/virtualenvs/sssom-tFe7wmy3-py3.9/lib64/python3.9/site-packages/pandas/core/indexes/base.py", line 5941, in _raise_if_missing
    raise KeyError(f"{not_found} not in index")
KeyError: "['subject_label'] not in index"

More generally, it seems that the error is triggered by any subject_* column that does not have its object_* counterpart in the set. For example, replacing subject_label by subject_source in the example above will yield exactly the same error trace, with a KeyError: "['subject_source'] not in index" message.

This issue affects the conversion to OWL as well (sssom convert -O owl), because that conversion involves at some point the inversion of the mapping set to convert.

Issue originally found in https://github.com/monarch-initiative/omim/issues/114. Reproduced with the latest code from the master branch of sssom-py.

gouttegd commented 2 hours ago

I believe the problem lies here:

inverted_df = df_to_invert.rename(
    columns=_invert_column_names(list_of_subject_object_columns, columns_invert_map)
)
inverted_df = inverted_df[df.columns]

As I understand it, the last line in that fragment is intended to reorder the columns in the inverted_df, so that they are in the same order as in the original df_to_invert despite their renaming. That is, the renaming turned, for example, subject_id into object_id and the other way around, but the columns are still in their original positions, so the reordering is necessary to ensure the renamed columns are at their expected positions (e.g., the new subject_id should be the first column).

But that reordering necessarily supposes that the inverted data frame will always contain the same columns as the original data frame. This is an unwarranted assumption. It won’t be the case if the set contains an subject_* column that does not have an object_* counterpart (which is perfectly valid in SSSOM, except for subject_id and object_id which must both always be present).

Suggested fix:

-    inverted_df = inverted_df[df.columns]
+    inverted_df = sort_df_rows_columns(inverted_df, by_rows=False)

so that reordering is performed by the appropriate function.