tbreloff / OnlineAI.jl

Machine learning for sequential/streaming data
Other
34 stars 10 forks source link

Splitting the data #1

Open Evizero opened 8 years ago

Evizero commented 8 years ago

We should probably exchange ideas on data sampling / splitting.

The approach I am currently using is very memory friendly for huge datasets by shuffling the data-matrix in-place and then making continuous array views.

Let X be a 10x10000000 Array{Float64,2}

julia> @time shuffle!(X)
elapsed time: 1.202112921 seconds (80 bytes allocated)

julia> @time train = view(X, :, 1:7000000)
elapsed time: 1.7596e-5 seconds (192 bytes allocated)

julia> @time test = view(X, :, 7000001:10000000)
elapsed time: 1.3097e-5 seconds (192 bytes allocated)

I couldn't find a better way so far. It does has its limitations when the sampling should be sensitive to the class distribution

tbreloff commented 8 years ago

See the nnet/data.jl source file... I'm using the idea of fixed arrays of DataPoint objects, and wrappers which access the arrays in different ways. I haven't benchmarked completely, though.

On Sep 4, 2015, at 3:56 PM, Christof Stocker notifications@github.com wrote:

We should probably exchange ideas on data sampling / splitting.

The approach I am currently using is very memory friendly for huge datasets by shuffling the data-matrix in-place and then making continuous array views.

Let X be a 10x10000000 Array{Float64,2}

julia> @time shuffle!(X) elapsed time: 1.202112921 seconds (80 bytes allocated)

julia> @time train = view(X, :, 1:7000000) elapsed time: 1.7596e-5 seconds (192 bytes allocated)

julia> @time test = view(X, :, 7000001:10000000) elapsed time: 1.3097e-5 seconds (192 bytes allocated) I couldn't find a better way so far. It does has its limitations when the sampling should be sensitive to the class distribution

— Reply to this email directly or view it on GitHub.