Open Evizero opened 8 years ago
See the nnet/data.jl source file... I'm using the idea of fixed arrays of DataPoint objects, and wrappers which access the arrays in different ways. I haven't benchmarked completely, though.
On Sep 4, 2015, at 3:56 PM, Christof Stocker notifications@github.com wrote:
We should probably exchange ideas on data sampling / splitting.
The approach I am currently using is very memory friendly for huge datasets by shuffling the data-matrix in-place and then making continuous array views.
Let X be a 10x10000000 Array{Float64,2}
julia> @time shuffle!(X) elapsed time: 1.202112921 seconds (80 bytes allocated)
julia> @time train = view(X, :, 1:7000000) elapsed time: 1.7596e-5 seconds (192 bytes allocated)
julia> @time test = view(X, :, 7000001:10000000) elapsed time: 1.3097e-5 seconds (192 bytes allocated) I couldn't find a better way so far. It does has its limitations when the sampling should be sensitive to the class distribution
— Reply to this email directly or view it on GitHub.
We should probably exchange ideas on data sampling / splitting.
The approach I am currently using is very memory friendly for huge datasets by shuffling the data-matrix in-place and then making continuous array views.
Let X be a 10x10000000 Array{Float64,2}
I couldn't find a better way so far. It does has its limitations when the sampling should be sensitive to the class distribution