Open zkamvar opened 18 hours ago
Adding a relevant comment I made in response to a PR comment by @MKupperman
It seems casting to missing values in python is changing or even removing the column datatype. This also happens in R when setting a column's content to all
NA
s which changes the data type to logical (Boolean) by default. In R we can set an allNA
column to a specific data type though by using a typed version of NA, e.g. To conserve a character column we could assingNA_character_
instead ofNA
. Is there something equivalent in python?
Originally posted by @annakrystalli in https://github.com/reichlab/variant-nowcast-hub/pull/117#issuecomment-2427457366
Overall I think trying to cast the column data type in python before writing if possible would be preferable. I haven't tested it but while handling it in hubValidations
would be possible, it may well cause problems when opening and especially filtering on such columns through hubData
.
I think this is possible from my read of these docs but I'm not python versed enough to be sure https://pandas.pydata.org/pandas-docs/stable/user_guide/missing_data.html#experimental-na-scalar-to-denote-missing-values
That was my previous thought, using pd.NA
tokens instead of "NA" (since the column schema suggests it should be "NA").
I worked on this for a bit, and found that if you cast the pd.NA to string types, it correctly preserves the NA characters that the R check is expecting. It's a 1-liner,
df["output_type_id"] = df["output_type_id"].astype("string")
A note in the documentation would be helpful for future reference if the resolution on this issue is a won't fix.
Thank you for the investigation @MKupperman! I had tried something similar with astype("str")
and got bupkis (but not an error 🫤). I'm glad to know this works!
A note in the documentation would be helpful for future reference if the resolution on this issue is a won't fix.
The documentation needs an overhaul on that section and the next iteration of the schema will likely fix that by clarifying that the output_type_id
for point estimates is a missing/None/null type: https://github.com/hubverse-org/schemas/issues/109
This is great! Thanks @MKupperman for the investigation!
@zkamvar we should also document the importance of retaining the required column data type too with tips on how to do so in different languages.
For completeness I think we should explore all the potentials available to python users for recording missing values, e.g. NaN
and `None` also check what the output looks like when:
pd.NA
s to csv from python. NaN
s to parquetNaN
s to csvNone
s to parquetNone
s to csvFor complete downstream samity:
hubData::connect_hub()
or hubData::connect_model_output()
It seems the experimental pd.NA
is the closest to R
s and likely what we want to promote but it would be good to get our heads round what else works/doesn't work/folks might use and get ahead of the game in our documentation.
Overall I'm going to rename and move this issue to hubDocs
which seems the more appropriate location for it now.
As mentioned in https://github.com/reichlab/variant-nowcast-hub/pull/116#issuecomment-2427130801:
This might be addressed partially in https://github.com/hubverse-org/schemas/issues/109, but I wonder if it's possible to catch
vctrs_unspecified
and convert them to characters since we know those are going to always be missing values.