hubverse-org / hubDocs

https://hubverse.io
5 stars 6 forks source link

Test and document how to produce typed NA/missing values in Python. #198

Open zkamvar opened 18 hours ago

zkamvar commented 18 hours ago

As mentioned in https://github.com/reichlab/variant-nowcast-hub/pull/116#issuecomment-2427130801:

If we coerce to pd.NA (or None), the corresponding dtype that the validation tool receives is vctrs_unspecified, rather than chr. Similarly, using np.nan to encode gives a "double" (numeric) data type, instead of char. Seems like an issue to fix on the backend.

This might be addressed partially in https://github.com/hubverse-org/schemas/issues/109, but I wonder if it's possible to catch vctrs_unspecified and convert them to characters since we know those are going to always be missing values.

annakrystalli commented 18 hours ago

Adding a relevant comment I made in response to a PR comment by @MKupperman

It seems casting to missing values in python is changing or even removing the column datatype. This also happens in R when setting a column's content to all NAs which changes the data type to logical (Boolean) by default. In R we can set an all NA column to a specific data type though by using a typed version of NA, e.g. To conserve a character column we could assing NA_character_ instead of NA. Is there something equivalent in python?

Originally posted by @annakrystalli in https://github.com/reichlab/variant-nowcast-hub/pull/117#issuecomment-2427457366

Overall I think trying to cast the column data type in python before writing if possible would be preferable. I haven't tested it but while handling it in hubValidations would be possible, it may well cause problems when opening and especially filtering on such columns through hubData.

annakrystalli commented 18 hours ago

I think this is possible from my read of these docs but I'm not python versed enough to be sure https://pandas.pydata.org/pandas-docs/stable/user_guide/missing_data.html#experimental-na-scalar-to-denote-missing-values

MKupperman commented 17 hours ago

That was my previous thought, using pd.NA tokens instead of "NA" (since the column schema suggests it should be "NA").

I worked on this for a bit, and found that if you cast the pd.NA to string types, it correctly preserves the NA characters that the R check is expecting. It's a 1-liner,

df["output_type_id"] = df["output_type_id"].astype("string")

A note in the documentation would be helpful for future reference if the resolution on this issue is a won't fix.

zkamvar commented 16 hours ago

Thank you for the investigation @MKupperman! I had tried something similar with astype("str") and got bupkis (but not an error 🫤). I'm glad to know this works!

A note in the documentation would be helpful for future reference if the resolution on this issue is a won't fix.

The documentation needs an overhaul on that section and the next iteration of the schema will likely fix that by clarifying that the output_type_id for point estimates is a missing/None/null type: https://github.com/hubverse-org/schemas/issues/109

annakrystalli commented 5 hours ago

This is great! Thanks @MKupperman for the investigation!

@zkamvar we should also document the importance of retaining the required column data type too with tips on how to do so in different languages.

For completeness I think we should explore all the potentials available to python users for recording missing values, e.g. NaN and `None` also check what the output looks like when:

For complete downstream samity:

It seems the experimental pd.NA is the closest to Rs and likely what we want to promote but it would be good to get our heads round what else works/doesn't work/folks might use and get ahead of the game in our documentation.

Overall I'm going to rename and move this issue to hubDocs which seems the more appropriate location for it now.