gibsramen / qupid

Case-control matching for microbiome data
BSD 3-Clause "New" or "Revised" License
10 stars 1 forks source link

Improve debug messaging for errors caused by type mismatch #16

Closed lucaspatel closed 2 years ago

lucaspatel commented 2 years ago

I recently an into an issue with Qupid that stumped me for half an hour. Basically, I ran Qupid using data from Qiita as follows:

example_data = prepare_data("data/analyses/wisc_meta/153241_metadata.tsv")

from qupid import match_by_multiple

nor_str = "Normal"
ad_str = "Dementia-AD"

# pairs for no rarefaction
background = example_data.query("diagnosis == @nor_str")
focus = example_data.query("diagnosis == @ad_str")

example_data = match_by_multiple(
    focus=focus,
    background=background,
    categories=["sex", "mars_age"],
    tolerance_map={"mars_age": 4.5}
)

Qupid outputted the following error:

---------------------------------------------------------------------------
NoMatchesError                            Traceback (most recent call last)
/Users/lucas/Documents/knight-rotation/LNP_02_U19_Prepare_SPSS.ipynb Cell 12 in <cell line: 14>()
     11 background = example_data.query("diagnosis == @nor_str")
     12 focus = example_data.query("diagnosis == @ad_str")
---> 14 example_data = match_by_multiple(
     15     focus=focus,
     16     background=background,
     17     categories=["sex", "mars_age"],
     18     tolerance_map={"mars_age": 4.5}
     19 )

File ~/Downloads/miniconda3-intel/envs/qiime2-2022.2/lib/python3.8/site-packages/qupid/qupid.py:107, in match_by_multiple(focus, background, categories, tolerance_map, on_failure)
    105 for cat in categories:
    106     tol = tolerance_map.get(cat, 1e-08)
--> 107     observed = match_by_single(focus[cat], background[cat],
    108                                tol, on_failure).case_control_map
    109     for fidx, fhits in observed.items():
    110         # Reduce the matches with successive categories
    111         matches[fidx] = matches[fidx] & fhits

File ~/Downloads/miniconda3-intel/envs/qiime2-2022.2/lib/python3.8/site-packages/qupid/qupid.py:55, in match_by_single(focus, background, tolerance, on_failure)
     53 else:
     54     if on_failure == "raise":
---> 55         raise exc.NoMatchesError(f_idx)
     56     else:
     57         matches[f_idx] = set()

NoMatchesError: No valid matches found for sample 13663.mars00006.

After working at it for a bit, I realized the problem is because the input focus and background data frames contained a few mars_age values that were non-numeric. I fixed this with the following line run before the Qupid lines:

example_data.loc[(example_data["mars_age"] == '>90'),'mars_age'] = "90"

Even though this resolves the non-numeric values, the error persists (with identical output) until the type of the dataframe column is updated (from object to numeric):

example_data["mars_age"] = pd.to_numeric(example_data["mars_age"])

After including this line, Qupid works as expected.

The error messages do not seem to reflect the root cause of the issue: that the type within the dataframe column is invalid. It would be preferable if Qupid could catch such type incompatibilities and report them to user in a more meaningful way.