rafaqz / DimensionalData.jl

Named dimensions and indexing for julia arrays and other data
https://rafaqz.github.io/DimensionalData.jl/stable/
MIT License
277 stars 41 forks source link

Cannot write CSV from diskbased DimArray #788

Open felixcremer opened 1 month ago

felixcremer commented 1 month ago

I can't write a diskarray backed DimArray to CSV but if I load it to disk before hand it works. I am not sure, whether this should rather be a DiskArray or YAXArray issue, I can try to reduce the example further down to an MWE at the end of the week. jena is a selection of a diskbased YAXArray for one single pixel. I can write the data if I use readcubedata from YAXArrays beforehand to load the cube to disk. If I use DimTable explicitely, the writing seems to work, but is very slow when the data is still a DiskArray.


julia> CSV.write("examples/data/jena.csv", jena)
ERROR: BoundsError: attempt to access Tuple{Colon} at index [2]
Stacktrace:
  [1] getindex
    @ ./tuple.jl:31 [inlined]
  [2] #79
    @ ./range.jl:429 [inlined]
  [3] ntuple
    @ ./ntuple.jl:19 [inlined]
  [4] getindex
    @ ./range.jl:429 [inlined]
  [5] view(a::DiskArrayTools.DiskArrayStack{Union{…}, 2, DiskArrays.SubDiskArray{…}, 1}, i::Function)
    @ DiskArrayTools ~/.julia/packages/DiskArrayTools/141OI/src/DiskArrayTools.jl:67
  [6] vec(a::DiskArrayTools.DiskArrayStack{Union{…}, 2, DiskArrays.SubDiskArray{…}, 1})
    @ DiskArrays ~/.julia/packages/DiskArrays/6JA8Z/src/subarray.jl:52
  [7] vec(A::DimMatrix{Union{…}, Tuple{…}, Tuple{}, DiskArrayTools.DiskArrayStack{…}, Symbol, Dict{…}})
    @ DimensionalData ~/.julia/packages/DimensionalData/RxCda/src/array/array.jl:110
  [8] map
    @ ./tuple.jl:291 [inlined]
  [9] map(::Function, ::@NamedTuple{value::DimMatrix{…}})
    @ Base ./namedtuple.jl:266
 [10] DimTable(s::DimStack{…}; mergedims::Nothing)
    @ DimensionalData ~/.julia/packages/DimensionalData/RxCda/src/tables.jl:106
 [11] DimTable(x::YAXArray{…}; layersfrom::Nothing, mergedims::Nothing)
    @ DimensionalData ~/.julia/packages/DimensionalData/RxCda/src/tables.jl:144
 [12] DimTable(x::YAXArray{Union{…}, 2, DiskArrayTools.DiskArrayStack{…}, Tuple{…}, Dict{…}})
    @ DimensionalData ~/.julia/packages/DimensionalData/RxCda/src/tables.jl:131
 [13] columns(x::YAXArray{Union{…}, 2, DiskArrayTools.DiskArrayStack{…}, Tuple{…}, Dict{…}})
    @ DimensionalData ~/.julia/packages/DimensionalData/RxCda/src/tables.jl:14
 [14] _rows(x::YAXArray{Union{…}, 2, DiskArrayTools.DiskArrayStack{…}, Tuple{…}, Dict{…}})
    @ Tables ~/.julia/packages/Tables/8p03y/src/fallbacks.jl:93
 [15] rows(m::YAXArray{Union{…}, 2, DiskArrayTools.DiskArrayStack{…}, Tuple{…}, Dict{…}})
    @ Tables ~/.julia/packages/Tables/8p03y/src/matrix.jl:5
 [16] write(file::String, itr::YAXArray{…}; append::Bool, compress::Bool, writeheader::Nothing, partition::Bool, kw::@Kwargs{})
    @ CSV ~/.julia/packages/CSV/cwX2w/src/write.jl:197
 [17] write(file::String, itr::YAXArray{Union{…}, 2, DiskArrayTools.DiskArrayStack{…}, Tuple{…}, Dict{…}})
    @ CSV ~/.julia/packages/CSV/cwX2w/src/write.jl:162
 [18] top-level scope
    @ REPL[361]:1
Some type information was truncated. Use `show(err)` to see complete types.
rafaqz commented 2 weeks ago

Probably need to do modify(cache, dimarray) first to use chunk caching in DiskArrays. It should help a bit. But otherwise I'm not sure how we can get Tables.jl sources to read in chunk order... like a csv has to be written sequentially. You could also use the RechunkedDiskArray to force rows to be contiguous. We should put this in the DimensionalDataDiskArraysExt whenever we add it.

Tables.jl doesn't use iteration it uses indexing so the iterate optimisations don't help.