mapping-commons / sssom-py

Python toolkit for SSSOM mapping format
https://mapping-commons.github.io/sssom-py/index.html#
MIT License
48 stars 10 forks source link

Make `filter_redundant_rows` resistant to the case that there is an empty `confidence` column #546

Closed matentzn closed 1 month ago

matentzn commented 1 month ago

Overview

https://github.com/mapping-commons/sssom-py/blob/550206721911f711ee678eb1a8da50591649bd04/src/sssom/util.py#L429

We had the problem that this was failing:

https://github.com/mapping-commons/sssom-py/blob/550206721911f711ee678eb1a8da50591649bd04/src/sssom/util.py#L449

with AttributeError: 'Series' object has no attribute 'iterrows'

Log / traceback

``` python ../scripts/[lexmatch-sssom-compare.py](http://lexmatch-sssom-compare.py/) extract_unmapped_matches doid gard icd10cm icd10who icd11foundation ncit omim ordo \ --matches ../mappings/mondo-sources-all-lexical.sssom.tsv \ --output-dir lexmatch \ --summary lexmatch/[README.md](http://readme.md/) \ --exclusion reports/doid_term_exclusions.txt --exclusion reports/gard_term_exclusions.txt --exclusion reports/icd10cm_term_exclusions.txt --exclusion reports/icd10who_term_exclusions.txt --exclusion reports/icd11foundation_term_exclusions.txt --exclusion reports/ncit_term_exclusions.txt --exclusion reports/omim_term_exclusions.txt --exclusion reports/ordo_term_exclusions.txt /usr/local/lib/python3.10/dist-packages/sssom/[parsers.py](http://parsers.py/):428: ChainedAssignmentError: A value is trying to be set on a copy of a DataFrame or Series through chained assignment using an inplace method. When using the Copy-on-Write mode, such inplace method never works to update the original DataFrame or Series, because the intermediate object on which we are setting values always behaves as a copy. For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' instead, to perform the operation inplace on the original object. df2[CONFIDENCE].replace(r"^\s*$", np.NaN, regex=True, inplace=True) /usr/local/lib/python3.10/dist-packages/sssom/[util.py](http://util.py/):168: FutureWarning: Downcasting behavior in `replace` is deprecated and will be removed in a future version. To retain the old behavior, explicitly call `result.infer_objects(copy=False)`. To opt-in to the future behavior, set `pd.set_option('[future.no](http://future.no/)_silent_downcasting', True)` df.replace("", np.nan, inplace=True) /usr/local/lib/python3.10/dist-packages/sssom/[util.py](http://util.py/):168: FutureWarning: Downcasting behavior in `replace` is deprecated and will be removed in a future version. To retain the old behavior, explicitly call `result.infer_objects(copy=False)`. To opt-in to the future behavior, set `pd.set_option('[future.no](http://future.no/)_silent_downcasting', True)` df.replace("", np.nan, inplace=True) /usr/local/lib/python3.10/dist-packages/sssom/[util.py](http://util.py/):447: FutureWarning: The provided callable is currently using np.maximum.reduce. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string np.maximum.reduce instead. dfmax = df.groupby(key, as_index=False)[CONFIDENCE].apply(max).drop_duplicates() /usr/local/lib/python3.10/dist-packages/sssom/[util.py](http://util.py/):447: FutureWarning: The provided callable is currently using np.maximum.reduce. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string np.maximum.reduce instead. dfmax = df.groupby(key, as_index=False)[CONFIDENCE].apply(max).drop_duplicates() /usr/local/lib/python3.10/dist-packages/sssom/[util.py](http://util.py/):447: FutureWarning: The provided callable is currently using np.maximum.reduce. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string np.maximum.reduce instead. dfmax = df.groupby(key, as_index=False)[CONFIDENCE].apply(max).drop_duplicates() /usr/local/lib/python3.10/dist-packages/sssom/[util.py](http://util.py/):447: FutureWarning: The provided callable is currently using np.maximum.reduce. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string np.maximum.reduce instead. dfmax = df.groupby(key, as_index=False)[CONFIDENCE].apply(max).drop_duplicates() /usr/local/lib/python3.10/dist-packages/sssom/[util.py](http://util.py/):447: FutureWarning: The provided callable is currently using np.maximum.reduce. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string np.maximum.reduce instead. dfmax = df.groupby(key, as_index=False)[CONFIDENCE].apply(max).drop_duplicates() /usr/local/lib/python3.10/dist-packages/sssom/[util.py](http://util.py/):447: FutureWarning: The provided callable is currently using np.maximum.reduce. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string np.maximum.reduce instead. dfmax = df.groupby(key, as_index=False)[CONFIDENCE].apply(max).drop_duplicates() Traceback (most recent call last): File "/work/src/ontology/../scripts/[lexmatch-sssom-compare.py](http://lexmatch-sssom-compare.py/)", line 403, in main() File "/usr/local/lib/python3.10/dist-packages/click/[core.py](http://core.py/)", line 1157, in __call__ return self.main(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/click/[core.py](http://core.py/)", line 1078, in main rv = self.invoke(ctx) File "/usr/local/lib/python3.10/dist-packages/click/[core.py](http://core.py/)", line 1688, in invoke return _process_result(sub_ctx.command.invoke(sub_ctx)) File "/usr/local/lib/python3.10/dist-packages/click/[core.py](http://core.py/)", line 1434, in invoke return ctx.invoke(self.callback, **ctx.params) File "/usr/local/lib/python3.10/dist-packages/click/[core.py](http://core.py/)", line 783, in invoke return __callback(*args, **kwargs) File "/work/src/ontology/../scripts/[lexmatch-sssom-compare.py](http://lexmatch-sssom-compare.py/)", line 190, in extract_unmapped_matches unmapped_ont_df = get_unmapped_df( File "/work/src/ontology/../scripts/[lexmatch-sssom-compare.py](http://lexmatch-sssom-compare.py/)", line 299, in get_unmapped_df filtered_new_df = filter_redundant_rows(new_df) File "/usr/local/lib/python3.10/dist-packages/sssom/[util.py](http://util.py/)", line 449, in filter_redundant_rows for _, row in dfmax.iterrows(): File "/usr/local/lib/python3.10/dist-packages/pandas/core/[generic.py](http://generic.py/)", line 6299, in __getattr__ return object.__getattribute__(self, name) AttributeError: 'Series' object has no attribute 'iterrows' make[1]: *** [mondo-ingest.Makefile:447: lexmatch/[README.md](http://readme.md/)] Error 1 rm imports/ro_terms_combined.txt make[1]: Leaving directory '/work/src/ontology' make: *** [mondo-ingest.Makefile:333: build-mondo-ingest] Error 2 Command exited with non-zero status 2 ```

I patched this case here: https://github.com/monarch-initiative/mondo-ingest/pull/581, basically adding some dummy confidence values to the data frame.

Again, the case was: there was a confidence column, but with no "legal" float values in there.

Action items

joeflack4 commented 1 month ago

Possible solutions

https://github.com/mapping-commons/sssom-py/blob/550206721911f711ee678eb1a8da50591649bd04/src/sssom/util.py#L441

After the above line, we could do one of the following.

a. Raise an error

if len(df) == 0:
    raise RuntimeError('No confidence values were found in dataframe. Cannot process.')

b. Return an empty dataframe

if len(df) == 0:
    return pd.DataFrame()

This is what I often do in my code, but it requires downstream code to check for and handle this, and I doubt that the 9 usages of filter_redundant_rows() would incidentally all be set up to deal with this.