ome / omero-metadata

OMERO plugin for metadata manipulation https://www.openmicroscopy.org/omero/
GNU General Public License v2.0
7 stars 13 forks source link

Fix header detection for tables with sparse numerical data #77

Closed sbesson closed 2 years ago

sbesson commented 2 years ago

Fixes #76

Reproducible scenario

First create a minimal dataset/image hierarchy e.g. as follows:

touch test1.fake test2.fake
dataset=$(omero obj new Dataset name=sparse_table)
omero import -T $dataset test1.fake test2.fake

CSV files with sparse string data such as the one below are correctly handled by the current HEAD of omero-metadata.

$ cat sparse_string_column.csv 
Image name,meas1,meas2,meas3,meas4
test1.fake,1.1,1,high,low
test2.fake,0.5,2,,low

The columns with missing values are mapped as s/StringCOlumn and the missing value are turned into empty strings where running omero metadata populate --file sparse_string_column.csv $dataset e.g.

CSV files with sparse numerical columns such as the one below currently fail during the population command:

$ cat sparse_numeric_column.csv
Image name,meas1,meas2,meas3,meas4
test1.fake,1.1,1.2,high,low
test2.fake,,2.1,,low

Here, the meas1 column is currently mapped into a d header type/DoubleColumn by the pandas detection logic, With the default omero metadata populate --file command, the table population fails withValueError: Empty Double or Long value. Use --allow_nan to convert to NaN`.

Proposed changes

Since the library already includes some logic allowing the user to control whether NaN values are allowed in the OMERO.table (introduced in #60), this PR proposes the following changes

fe73a17d2a71fd7d220c48891a2364110b59f4f1 adds a cosmetic change defining GNU-style aliases of the command-line arguments (--manual-header, --allow-nan) using hyphen as separator. The existing underscore separated flags are preserved.

Testing

With these changes, annotating of sparse CSV tables using the default header detection should be functional in all cases.

  1. if the tabular data is dense or containing sparse string columns, the behavior of the command should be unchanged
  2. if the tabular data is dense or containing sparse numerical columns,, the behavior of the command will depend on the --allow-nan flag

    omero metadata populate --file sparse_numeric_column.csv $data

    will detect the sparse numeric column as aStringColumn and store the missing values as empty strings

    omero metadata populate --file sparse_numeric_column.csv --allow-nan $data

    will detect the sparse numeric column as a DoubleColumn and store missing values as nan

snoopycrimecop commented 2 years ago

Conflicting PR. Removed from build OMERO-plugins-push#1169. See the console output for more details. Possible conflicts:

--conflicts

snoopycrimecop commented 2 years ago

Conflicting PR. Removed from build OMERO-plugins-push#1176. See the console output for more details. Possible conflicts:

--conflicts Conflict resolved in build OMERO-plugins-push#1179. See the console output for more details.

sbesson commented 2 years ago

Thanks @muhanadz for the review. 27050f2 should amend the README where most of the information about the library usage is captured at the moment.

sbesson commented 2 years ago

Thanks @muhanadz. The new column detection behavior is now released as omero-metadata 0.11.0 🎉