Closed will-moore closed 2 years ago
If it's only the initial whitespace after the delimiter, I would think setting skipinitialspace to True
in all csv.reader()
invocations might suffice.
I am unsure whether there are concrete use cases where we want to preserve leading whitespaces for string values. If this is the case, this behavior should be made configurable.
I'm not sure about this. It's definitely convenient in this case, but it changes the script from one for transforming CSVs to one that actively does data cleaning and wrangling, and means the raw data may not exactly match that in the bulk annotations. For example in the IDR the CSVs in https://github.com/IDR/idr-metadata are the original data, and part of the import process is to clean them up if necessary.
How about an optional validate
/lint
command/flag that warns or errors if common errors are detected?
@sbesson Thanks for the skipinitialspace
pointer. I thought there should be something like that available but I didn't spot that.
I agree, it seems reasonable to add configuration for this once we know of any concrete use cases.
@manics I wouldn't see this as "cleaning", it's really just allowing your csv file to have spaces, which some do.
But I'm not sure we need this for any particular Training workflows cc @pwalczysko other than ones I was trying?
I thnik the main method for creating CSV files for our "basic" users would be to go via Excel. These CSVs are created "correctly". no whitespaces.
Maybe this is not worth the effort then ?
@will-moore Anything that involves manipulation of the raw data into a form in which it can be parsed counts as cleaning. In this example it's fairly innocuous, but in other situations people may be using a CSV as input to multiple pipelines. not just OMERO. Applications which clean data in different ways will lead to results which don't match as expected so we need to be clear if we're automatically making corrections to data.
CSV files may contain
values, with, whitespace
which should be ignored byvalue.strip()
. E.g. if a CSV file has been edited in a text-editor etc.