Header detection: default behavior and handling sparse columns

sbesson commented 2 years ago

Discovered while testing the HEAD of the omero-metadata including the new header detection feature introduced in #67 against an IDR high-content screening dataset.

The annotation CSV contains a combination of biomolecular annotations (Organism, compound name, identifiers) and analytical metadata (features). The feature columns are densely populated but some of the biomolecular annotations are sparse e.g. Compound Concentration (microMolar). This is expected since several rows correspond to control wells where there is no compound and this metadata is irrelevant.

The current version of the header detection code leads to issue in this case as these columns are detected as Double/Float and the table population subsequently fail unless --allow-nan is passed. With the current code, the workaround for completing the table population are:

either to disable the manual header detection with --manual_header
and/or to manually specify the behavior of the columns using the #header row

Ideally, it would be great to allow the plugin to "do the right thing" and handle these scenarios while retaining the automatic header detection to map the column the most appropriate type. This raises the question of whether there should be a single behavior or whether this would be another option down to the user.

In the IDR use case above, the expectation is that we want to preserve the sparsity rather than populating NaN values. Some downstream processes like the tables -> key/value conversion currently have logic that relies on the emptiness of the values in the table and I expect NaN might cause issues with the current implementation.

There are likely other use cases where the user would like empty values to be stroed as NaN. And it should be possible to update the transformation of tables into maps to handle NaN in the same way we handle empty strings.

Code-wise, it should be possible to make use pandas.read_csv keep_default_na option to map such column as object/StringColumn rather than float:

(base) sbesson@Sebastiens-MacBook-Pro /tmp % cat test.csv 
Column1,Column2,Column3
A,1,2
B,,3
C,2,5%                                                                          
(base) sbesson@Sebastiens-MacBook-Pro /tmp % venv/bin/python
Python 3.8.11 (default, Jul 29 2021, 14:57:32) 
[Clang 12.0.0 ] :: Anaconda, Inc. on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import pandas
>>> df=pandas.read_csv('test.csv')
>>> df
  Column1  Column2  Column3
0       A      1.0        2
1       B      NaN        3
2       C      2.0        5
>>> df.dtypes
Column1     object
Column2    float64
Column3      int64
dtype: object
>>> df=pandas.read_csv('test.csv',keep_default_na=False)
>>> df
  Column1 Column2  Column3
0       A       1        2
1       B                3
2       C       2        5
>>> df.dtypes
Column1    object
Column2    object
Column3     int64
dtype: object

Possibly, this is something that could be coupled with the existing --allow-nan flag?

cc @muhanadz @pwalczysko @will-moore

sbesson commented 2 years ago

Briefly mentioned as part of today's group meeting. @will-moore mentioned that if a column is truly numeric, it's certainly lossy to turn it back into StringColumn. This possibly raises the question of how NaN appear in the UI e.g. in the omero_table endpoint and/or the Tables menu.

sbesson commented 2 years ago

Also for background reading, see https://pandas.pydata.org/pandas-docs/dev/user_guide/gotchas.html#nan-integer-na-values-and-na-type-promotions re pandas decision to use NaN as the representation of missing data

ome / omero-metadata

Header detection: default behavior and handling sparse columns #76