scicloj / tablecloth

Dataset manipulation library built on the top of tech.ml.dataset
https://scicloj.github.io/tablecloth
MIT License
305 stars 27 forks source link

Unexpected results when instantiating a dataset from array of 2-element arrays where first element is a string or keyword #142

Closed kirahowe closed 8 months ago

kirahowe commented 8 months ago

One would expect the following to create a dataset with 2 columns and rows populated with the data from each tuple, but you get an unexpected result (a 3x2 dataset)

(tc/dataset [["a" 2] ["b" 3] ["c" 4]] {:column-names ["Col A" "Col B"]})
;; >>>
| a | b | c |
|--:|--:|--:|
| 2 | 3 | 4 |

(as opposed to)

| Col A | Col B |
|------:|------:|
|     a |     2 |
|     b |     3 |
|     c |     4 |

The offending lines of code are here: https://github.com/scicloj/tablecloth/blob/master/src/tablecloth/api/dataset.clj#L75-L79

One possible solution is to just remove this special handling of 2-element iterables and issue a breaking release, another might be to handle the case where a map is given (which seems to be what this is trying to catch, I think?), but not apply the same logic to all 2-element seqs (which would still be a breaking change).

I'd be happy to contribute a PR once a decision is made about if/how to address this issue.

genmeblog commented 8 months ago

Yes... As we've discussed on Zulip, solution proposed by jsa is probably the best here.

If :column-names are defined - treat seq of pairs as rows, otherwise treat as it is now.

What do you think?

genmeblog commented 8 months ago

zulip discussion: https://clojurians.zulipchat.com/#narrow/stream/151924-data-science/topic/tablecloth/near/427820581