Better inferencing of missing metadata fields.

jmmitc06 commented 8 months ago

Describe the bug

Missing metadata fields are cast to np.nan (or whatever pandas think they should be) when there are missing values in the metadata csv file. This can be unexpected since fields may be expected to be string-like rather than numerical. The fundamental difficulty here is deciding what the missing value should be cast to. An empty string is appropriate in some cases, while NaN is more appropriate for numerical fields. When a field is completely missing for all samples though there is no way to know what the inferred type should be to decide if '' or np.nan is appropriate.

Furthermore, are missing values in a metadata table an issue that should warrant a warning? Literally every sequence file in the lab has missing fields since they are not used while running samples.

To Reproduce Steps to reproduce the behavior:

have a csv file as input with at least one missing value for a field.

jmmitc06 commented 8 months ago

This has been addressed in the readme as a stop gap solution

jmmitc06 commented 8 months ago

@gmhhope How does this sound as a solution?

We keep the behavior of interpreting missing values as NaN since switching away from pandas for reading the CSV files will certainly introduce a new dependency or require writing our own solution.

However, any time we see a NaN value in a metadata CSV we emit a warning to the user to the effect of: "missing field values detected in metadata file, will fill with NaN". Then, during figure generation, NaN becomes "Other" to make it clear that the missing values may or may not represent the same class?

jmmitc06 commented 8 months ago

@gmhhope

Please let me know if you believe this is a satisfactory solution. I am waiting for your approval before implementing this since you reported the issue.

jmmitc06 commented 7 months ago

@gmhome, would have preferred to have your approval on this solution but since I did not hear back I'm going with the solution proposed two comments ago.

jmmitc06 commented 3 months ago

Closing this issue as the original person who raised the problem did not respond.

shuzhao-li-lab / PythonCentricPipelineForMetabolomics

Better inferencing of missing metadata fields. #54