lisad / phaser

The missing layer for complex data batch integration pipelines
MIT License
9 stars 1 forks source link

Default to dropping rows with only whitespace #118

Closed lisad closed 5 months ago

lisad commented 6 months ago

My first obstacle getting the boston pipeline working again is the two lines at the end of the file that only have whitespace. This is causing a column error because the rows don't have a value for COUNT_TYPE from the list of allowed values.

While it might sometimes work to make this column drop rows with a count type of None, that might not be appropriate - it might be that if a row that has OTHER values has a COUNT_TYPE of None or another not allowed value, that should raise an error and stop the pipeline. So it would be more careful to drop rows that are entirely blank, then keep the careful validation of allowed values for this one column. .

While doing this, we should consider whether a row that is all empty values is also dropped (i.e. it has a row of just commas), or if that's a different case. I think the library could default to dropping both of these cases.

lisad commented 5 months ago

Dropping fully empty rows does work now, but a row of just commas does not yet work. There's a test for it (test_csv.py::test_empty_line_only_commas) but it's skipped for now