scicloj / tablecloth

Dataset manipulation library built on the top of tech.ml.dataset
https://scicloj.github.io/tablecloth
MIT License
290 stars 25 forks source link

new feature for split: split into more then 2 #28

Closed behrica closed 3 years ago

behrica commented 3 years ago

We could add one other split which works like :holdout but splits in 3 (or X). Maybe just be of type :split getting :num-splits 3 :ratios [ 0.2 0.2 0.6]

and the returned sets are just called split-0, split-1 ....

Would complete the functionality for cases of split into train, test, validation

X > 3 is eventually not needed, but X =3 is rather common

genmeblog commented 3 years ago

Omg, I was just thinking that someone will come and say: give me more splits! :)

genmeblog commented 3 years ago

It's loosely related to #27

behrica commented 3 years ago

I agree. This could be solved by having "partition-by" and a easy way to generate a columns on a distribution of values:

:train 0.5 :test 0.4 :val 0.1

I like this.

behrica commented 3 years ago

Do we have this somehwere ? Create a column of values following a specific distribution (given as a map) ?

(tc/create-column-from-distribution ds :type {:train 0.5 :test 0.4 :val 0.1}  )
behrica commented 3 years ago

Is there a more general pattern ? create-column-from-distribution ?

behrica commented 3 years ago

Looking here: https://generateme.github.io/fastmath/fastmath.random.html#var-distribution

behrica commented 3 years ago

and seeing this: :integer-discrete-distribution (:data :probabilities),

behrica commented 3 years ago

Could we have a method which works with all (or some) distributions from fastmath ? I did not fully understand the distribution feature in fastmath, but it seems to me that there is maybe an abstract which can be used here easely to allow to fill a column given a distribution (and distribution specific parameters)

behrica commented 3 years ago

I found it:

(require '[fastmath.random :as rnd])
 (take 20 (repeatedly  #(rnd/sample (rnd/distribution :integer-discrete-distribution {:data [0 1 2]  :probabilities   [0.5 0.4 0.1] }))))
genmeblog commented 3 years ago

Yep, that's that. We can start to think about such columnar operations / transformations / fillers etc... That was my idea to bring tablecloth.transform and similar functions to create bunch of such operation available.

genmeblog commented 3 years ago

Generally every distribution in fastmath follows the same protocol and we can provide more general "fillers" if necessary (it falls into a category of columnar functions)

behrica commented 3 years ago

Currently we can use this code:

(api/add-column ds :type (repeatedly  #(rnd/sample (rnd/distribution :integer-discrete-distribution {:data [0 1 2]  :probabilities   [0.5 0.4 0.1] })) ))
behrica commented 3 years ago

Maybe this is even good enough ? Could we get a lot better with a specific new method ?

behrica commented 3 years ago

something like:

tc/fill-from-distribution

could save to type "repeatedly" and "sample"

behrica commented 3 years ago

Yep, that's that. We can start to think about such columnar operations / transformations / fillers etc... That was my idea to bring tablecloth.transform and similar functions to create bunch of such operation available.

What do you mean by

api/xxxx

vs

tablecloth.transform/xxx

?

(api/fill-from-distribution :type (rnd/distribution :integer-discrete-distribution {:data [0 1 2]  :probabilities   [0.5 0.4 0.1] })))

vs

(tablecloth.transform/distribution-transformer ..... )

Or is it only a question of "api ns" vs "other ns"

genmeblog commented 3 years ago

well... maybe this way: by API I mean public functions in TC (regardless place of existance, it can be tablecloth.api or tablecloth.pipeline etc). by api/... I mean functions exposed by tablecloth.api. If we introduce another namespace, say tablecloth.transformer or tablecloth.functions or whatever we can decide to expose them in tablecloth.api or not. I don't know it now actually. Maybe it's a good idea to have everything in tablecloth.api maybe not. I would postpone this decision to the latest point as possbble.

genmeblog commented 3 years ago

It's done in 5.02, please refer holdout examples.