Closed ezand closed 11 months ago
I didn't test it, but this looks like an improvement to me upon skimming it.
I vaguely recall there being similar issues with other formats (csv/tsv, parquet?). I wonder if something like this should be handled at a higher level?
I wonder if something like this should be handled at a higher level?
Thanks for the tip @harold. I'll look into that and see if we can lift this functionality a bit further up
For sure! No worries. It might turn out to be a lot more complicated. Landing this at this level could totally make sense, and then shifting it up later if there's demand.
Cool contribution, thank you!
@harold I moved the uniquess-logic to tech.v3.dataset.io.context/options->col-idx-parse-context
instead. I think actually it made the solution even cleaner this way 🤷 What do you think?
I would appreciate if someone else tested a bit before merging since we're touching upon some "core" functionality. I believe the automated tests are passing, but wouldn't hurt with some manual sanity checks.
@cnuernber - now that this got a bit abstract, I wonder how it looks to you.
I think it looks find - it didn't change existing systems and fixes an issue with spreadsheets. People may want control over the exact algorithm used to keep columns unique so for instance guaranteeing that rand can't return a already existing number. I think they can file issues for that if it comes to it.
Thanks for reviewing this guys!
People may want control over the exact algorithm used to keep columns unique so for instance guaranteeing that rand can't return a already existing number. I think they can file issues for that if it comes to it.
@cnuernber In the second commit in the PR I actually added support for using a custom fn for keeping columns unique :unique-column-name-fn
:
https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset.clj#L90
If that was what you referred to 😊
That is what I meant - nice work :-).
This stuff has come up in the past:
This change will hopefully ameliorate some suffering in the future. :bow:
Issue: If the source data contains duplicate header values, an exception will be thrown when converting it to a dataset.
Proposed solution:
:ensure-unique-column-names?
and:unique-column-name-fn
.:ensure-unique-column-names?=true
, make sure all headers values are unique before further dataset processing.unique-column-name-fn
will be applied to all non-unique headers. In many cases this might be better than failing with an exception, making it impossible to read for example an Excel file without modifying it.fastexcel
version (this can be extracted to it's own PR if preferred).