Splitting the data - Githubissues

tbreloff / OnlineAI.jl

Machine learning for sequential/streaming data

Other

34 stars 10 forks source link

julia> @time shuffle!(X) elapsed time: 1.202112921 seconds (80 bytes allocated) julia> @time train = view(X, :, 1:7000000) elapsed time: 1.7596e-5 seconds (192 bytes allocated) julia> @time test = view(X, :, 7000001:10000000) elapsed time: 1.3097e-5 seconds (192 bytes allocated)

See the nnet/data.jl source file... I'm using the idea of fixed arrays of DataPoint objects, and wrappers which access the arrays in different ways. I haven't benchmarked completely, though.

On Sep 4, 2015, at 3:56 PM, Christof Stocker notifications@github.com wrote:

We should probably exchange ideas on data sampling / splitting.

The approach I am currently using is very memory friendly for huge datasets by shuffling the data-matrix in-place and then making continuous array views.

Let X be a 10x10000000 Array{Float64,2}

julia> @time shuffle!(X) elapsed time: 1.202112921 seconds (80 bytes allocated)

julia> @time train = view(X, :, 1:7000000) elapsed time: 1.7596e-5 seconds (192 bytes allocated)

julia> @time test = view(X, :, 7000001:10000000) elapsed time: 1.3097e-5 seconds (192 bytes allocated) I couldn't find a better way so far. It does has its limitations when the sampling should be sensitive to the class distribution

— Reply to this email directly or view it on GitHub.

tbreloff / OnlineAI.jl

Splitting the data #1