https://github.com/ome/omero-metadata/pull/67 introduces a new strategy based on the pandas library for parsing the columns of a CSV file and choosing the appropriate OMERO.table columns types when running populate metadata with the default ParsingContext. The initial implementation was introduced at the MetadataControl level, allowing to generate a column_types list and pass it to the existing API of the HeaderResolver.
A downside of this approach is that any non CLI-based usage of the new functionality requires the omero_metadata.cli.MetadataControl class to be approach- see https://github.com/ome/omero-metadata/pull/67#issuecomment-1082029510. A minimal approach would be to migrate the column types detection logic under the omero_metadata.library module.
Capturing a few wider thoughts about the migration of this API down at the library level:
are we expecting to support the former column type detection strategy alongside the new approach? If not, the HeaderResolver logic could potentially be deprecated in favor of a new implementation e.g. HeaderResolver2/PandasResolver/...
at the moment the metadata population code makes several full reads of the CSV file even after detecting the columns, first to perform the object resolution and then to populate each row of the table. Possibly in the case of very large analytical tables, these multiple reads can be a bottleneck of the annotation workflow (I have no numbers, definitely something worth benchmarking). In this case, a secondary advantage approach of the pandas approach is that it creates an in-memory representation of the CSV file into a DataFrame which could then be modified e.g. by appending column and used for generating the table.
https://github.com/ome/omero-metadata/pull/67 introduces a new strategy based on the
pandas
library for parsing the columns of a CSV file and choosing the appropriate OMERO.table columns types when runningpopulate metadata
with the defaultParsingContext
. The initial implementation was introduced at theMetadataControl
level, allowing to generate acolumn_types
list and pass it to the existing API of theHeaderResolver
.A downside of this approach is that any non CLI-based usage of the new functionality requires the
omero_metadata.cli.MetadataControl
class to be approach- see https://github.com/ome/omero-metadata/pull/67#issuecomment-1082029510. A minimal approach would be to migrate the column types detection logic under theomero_metadata.library
module.Capturing a few wider thoughts about the migration of this API down at the library level:
HeaderResolver
logic could potentially be deprecated in favor of a new implementation e.g.HeaderResolver2/PandasResolver/...
pandas
approach is that it creates an in-memory representation of the CSV file into a DataFrame which could then be modified e.g. by appending column and used for generating the table.