Closed sbesson closed 2 years ago
Briefly mentioned as part of today's group meeting. @will-moore mentioned that if a column is truly numeric, it's certainly lossy to turn it back into StringColumn
.
This possibly raises the question of how NaN appear in the UI e.g. in the omero_table
endpoint and/or the Tables
menu.
Also for background reading, see https://pandas.pydata.org/pandas-docs/dev/user_guide/gotchas.html#nan-integer-na-values-and-na-type-promotions re pandas decision to use NaN
as the representation of missing data
Discovered while testing the HEAD of the
omero-metadata
including the new header detection feature introduced in #67 against an IDR high-content screening dataset.The annotation CSV contains a combination of biomolecular annotations (Organism, compound name, identifiers) and analytical metadata (features). The feature columns are densely populated but some of the biomolecular annotations are sparse e.g.
Compound Concentration (microMolar)
. This is expected since several rows correspond to control wells where there is no compound and this metadata is irrelevant.The current version of the header detection code leads to issue in this case as these columns are detected as
Double/Float
and the table population subsequently fail unless--allow-nan
is passed. With the current code, the workaround for completing the table population are:--manual_header
#header
rowIdeally, it would be great to allow the plugin to "do the right thing" and handle these scenarios while retaining the automatic header detection to map the column the most appropriate type. This raises the question of whether there should be a single behavior or whether this would be another option down to the user.
In the IDR use case above, the expectation is that we want to preserve the sparsity rather than populating NaN values. Some downstream processes like the tables -> key/value conversion currently have logic that relies on the emptiness of the values in the table and I expect NaN might cause issues with the current implementation.
There are likely other use cases where the user would like empty values to be stroed as
NaN
. And it should be possible to update the transformation of tables into maps to handleNaN
in the same way we handle empty strings.Code-wise, it should be possible to make use pandas.read_csv
keep_default_na
option to map such column asobject/StringColumn
rather thanfloat
:Possibly, this is something that could be coupled with the existing
--allow-nan
flag?cc @muhanadz @pwalczysko @will-moore