ome / omero-metadata

OMERO plugin for metadata manipulation https://www.openmicroscopy.org/omero/
GNU General Public License v2.0
7 stars 13 forks source link

populate metadata should strip CSV values #24

Closed will-moore closed 2 years ago

will-moore commented 5 years ago

CSV files may contain values, with, whitespace which should be ignored by value.strip(). E.g. if a CSV file has been edited in a text-editor etc.

sbesson commented 5 years ago

If it's only the initial whitespace after the delimiter, I would think setting skipinitialspace to True in all csv.reader() invocations might suffice.

I am unsure whether there are concrete use cases where we want to preserve leading whitespaces for string values. If this is the case, this behavior should be made configurable.

manics commented 5 years ago

I'm not sure about this. It's definitely convenient in this case, but it changes the script from one for transforming CSVs to one that actively does data cleaning and wrangling, and means the raw data may not exactly match that in the bulk annotations. For example in the IDR the CSVs in https://github.com/IDR/idr-metadata are the original data, and part of the import process is to clean them up if necessary.

How about an optional validate/lint command/flag that warns or errors if common errors are detected?

will-moore commented 5 years ago

@sbesson Thanks for the skipinitialspace pointer. I thought there should be something like that available but I didn't spot that. I agree, it seems reasonable to add configuration for this once we know of any concrete use cases. @manics I wouldn't see this as "cleaning", it's really just allowing your csv file to have spaces, which some do. But I'm not sure we need this for any particular Training workflows cc @pwalczysko other than ones I was trying?

pwalczysko commented 5 years ago

I thnik the main method for creating CSV files for our "basic" users would be to go via Excel. These CSVs are created "correctly". no whitespaces.

Maybe this is not worth the effort then ?

manics commented 5 years ago

@will-moore Anything that involves manipulation of the raw data into a form in which it can be parsed counts as cleaning. In this example it's fairly innocuous, but in other situations people may be using a CSV as input to multiple pipelines. not just OMERO. Applications which clean data in different ways will lead to results which don't match as expected so we need to be clear if we're automatically making corrections to data.