Closed behrica closed 3 years ago
Omg, I was just thinking that someone will come and say: give me more splits! :)
It's loosely related to #27
I agree. This could be solved by having "partition-by" and a easy way to generate a columns on a distribution of values:
:train 0.5 :test 0.4 :val 0.1
I like this.
Do we have this somehwere ? Create a column of values following a specific distribution (given as a map) ?
(tc/create-column-from-distribution ds :type {:train 0.5 :test 0.4 :val 0.1} )
Is there a more general pattern ? create-column-from-distribution ?
and seeing this: :integer-discrete-distribution (:data :probabilities),
Could we have a method which works with all (or some) distributions from fastmath ? I did not fully understand the distribution feature in fastmath, but it seems to me that there is maybe an abstract which can be used here easely to allow to fill a column given a distribution (and distribution specific parameters)
I found it:
(require '[fastmath.random :as rnd])
(take 20 (repeatedly #(rnd/sample (rnd/distribution :integer-discrete-distribution {:data [0 1 2] :probabilities [0.5 0.4 0.1] }))))
Yep, that's that. We can start to think about such columnar operations / transformations / fillers etc... That was my idea to bring tablecloth.transform
and similar functions to create bunch of such operation available.
Generally every distribution in fastmath follows the same protocol and we can provide more general "fillers" if necessary (it falls into a category of columnar functions)
Currently we can use this code:
(api/add-column ds :type (repeatedly #(rnd/sample (rnd/distribution :integer-discrete-distribution {:data [0 1 2] :probabilities [0.5 0.4 0.1] })) ))
Maybe this is even good enough ? Could we get a lot better with a specific new method ?
something like:
tc/fill-from-distribution
could save to type "repeatedly" and "sample"
Yep, that's that. We can start to think about such columnar operations / transformations / fillers etc... That was my idea to bring
tablecloth.transform
and similar functions to create bunch of such operation available.
What do you mean by
api/xxxx
vs
tablecloth.transform/xxx
?
(api/fill-from-distribution :type (rnd/distribution :integer-discrete-distribution {:data [0 1 2] :probabilities [0.5 0.4 0.1] })))
vs
(tablecloth.transform/distribution-transformer ..... )
Or is it only a question of "api ns" vs "other ns"
well... maybe this way: by API I mean public functions in TC (regardless place of existance, it can be tablecloth.api
or tablecloth.pipeline
etc). by api/...
I mean functions exposed by tablecloth.api
. If we introduce another namespace, say tablecloth.transformer
or tablecloth.functions
or whatever we can decide to expose them in tablecloth.api
or not. I don't know it now actually. Maybe it's a good idea to have everything in tablecloth.api
maybe not. I would postpone this decision to the latest point as possbble.
It's done in 5.02
, please refer holdout
examples.
We could add one other split which works like :holdout but splits in 3 (or X). Maybe just be of type :split getting :num-splits 3 :ratios [ 0.2 0.2 0.6]
and the returned sets are just called split-0, split-1 ....
Would complete the functionality for cases of split into train, test, validation
X > 3 is eventually not needed, but X =3 is rather common