Support converting data with duplicate headers values into datasets

techascent / tech.ml.dataset

A Clojure high performance data processing system

Eclipse Public License 1.0

680 stars 35 forks source link

Support converting data with duplicate headers values into datasets #386

Closed ezand closed 11 months ago

ezand commented 11 months ago

Issue: If the source data contains duplicate header values, an exception will be thrown when converting it to a dataset.

Proposed solution:

Add new options :ensure-unique-column-names? and :unique-column-name-fn.
- If :ensure-unique-column-names?=true, make sure all headers values are unique before further dataset processing. unique-column-name-fn will be applied to all non-unique headers. In many cases this might be better than failing with an exception, making it impossible to read for example an Excel file without modifying it.
Also bump fastexcel version (this can be extracted to it's own PR if preferred).

harold commented 11 months ago

I didn't test it, but this looks like an improvement to me upon skimming it.

harold commented 11 months ago

I vaguely recall there being similar issues with other formats (csv/tsv, parquet?). I wonder if something like this should be handled at a higher level?

ezand commented 11 months ago

I wonder if something like this should be handled at a higher level?

Thanks for the tip @harold. I'll look into that and see if we can lift this functionality a bit further up

harold commented 11 months ago

For sure! No worries. It might turn out to be a lot more complicated. Landing this at this level could totally make sense, and then shifting it up later if there's demand.

Cool contribution, thank you!

ezand commented 11 months ago

@harold I moved the uniquess-logic to tech.v3.dataset.io.context/options->col-idx-parse-context instead. I think actually it made the solution even cleaner this way 🤷 What do you think?

I would appreciate if someone else tested a bit before merging since we're touching upon some "core" functionality. I believe the automated tests are passing, but wouldn't hurt with some manual sanity checks.

harold commented 11 months ago

@cnuernber - now that this got a bit abstract, I wonder how it looks to you.

cnuernber commented 11 months ago

I think it looks find - it didn't change existing systems and fixes an issue with spreadsheets. People may want control over the exact algorithm used to keep columns unique so for instance guaranteeing that rand can't return a already existing number. I think they can file issues for that if it comes to it.

ezand commented 11 months ago

Thanks for reviewing this guys!

People may want control over the exact algorithm used to keep columns unique so for instance guaranteeing that rand can't return a already existing number. I think they can file issues for that if it comes to it.

@cnuernber In the second commit in the PR I actually added support for using a custom fn for keeping columns unique :unique-column-name-fn: https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset.clj#L90

If that was what you referred to 😊

cnuernber commented 11 months ago

That is what I meant - nice work :-).

harold commented 11 months ago

This stuff has come up in the past:

https://clojurians.zulipchat.com/#narrow/stream/236259-tech.2Eml.2Edataset.2Edev/topic/Version.207.2E000/near/370146747

https://clojurians.zulipchat.com/#narrow/stream/236259-tech.2Eml.2Edataset.2Edev/topic/Version.207.2E000/near/367243189

This change will hopefully ameliorate some suffering in the future. :bow: