Open JoshuaBillson opened 1 year ago
Can you just use a Raster
??? Then you are indexing a single array.
Stacks with that many layers dont make sense, the tuple is too big for the compiler. I am contemplating adding a different kind of stack that uses a vector, but it wont be type stable for mixed layer types, and will still be much slower than a Raster.
You really only need a RasterStack when you have bands of different types, like in a netcdf, or with multuple separate tif files.
I cannot use a Raster because I need to extract the spectral signatures and corresponding landcover classes into a DataFrame
. That is, the RasterStack
contains both remotely sensed imagery and land cover data, which needs to be harmonized for further analysis.
Interestingly, I've found that sinking the RasterStack
directly into a DataFrame
via the Tables.jl
interface seems to solve the problem (it was taking around 30 seconds and 14GB of allocations to iterate over the data manually).
julia> @btime desis.stack |> replace_missing |> DataFrame |> dropmissing
721.353 ms (13647 allocations: 3.03 GiB)
1086341×237 DataFrame
Row │ X Y Band_1 Band_2 Band_3 Band_4 Band_5 Band_6 Band_7 Band_8 Band_9 Band_10 Band_11 Band_12 Band_13 Band_14 ⋯
│ Float64 Float64 Int16 Int16 Int16 Int16 Int16 Int16 Int16 Int16 Int16 Int16 Int16 Int16 Int16 Int16 ⋯
─────────┼───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
1 │ 705630.0 5.75541e6 0 0 0 0 0 0 8 68 151 104 171 212 224 219 ⋯
2 │ 705660.0 5.75541e6 0 0 0 0 0 0 12 43 131 175 191 222 265 272
3 │ 705690.0 5.75541e6 0 0 0 0 0 0 23 66 166 210 190 209 288 254
4 │ 705720.0 5.75541e6 0 0 0 0 0 0 17 136 180 210 205 237 247 254
5 │ 705750.0 5.75541e6 0 0 0 0 0 30 39 128 189 198 196 197 305 256 ⋯
6 │ 705780.0 5.75541e6 0 0 0 0 0 0 52 87 154 201 195 225 280 282
7 │ 705810.0 5.75541e6 0 0 0 0 0 12 61 131 166 184 200 228 285 276
8 │ 705840.0 5.75541e6 0 0 0 0 0 0 33 79 143 212 196 226 303 256
9 │ 705870.0 5.75541e6 0 0 0 0 0 0 15 91 174 185 182 201 291 277 ⋯
⋮ │ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋱
1086334 │ 679890.0 5.72187e6 0 0 0 0 0 38 240 274 297 275 271 274 249 286
1086335 │ 679920.0 5.72187e6 0 0 0 0 0 74 252 249 269 286 284 254 268 254
1086336 │ 679950.0 5.72187e6 0 0 0 0 0 159 247 239 249 272 267 237 282 270
1086337 │ 679980.0 5.72187e6 0 0 0 0 0 117 170 220 279 266 244 262 296 276 ⋯
1086338 │ 680010.0 5.72187e6 0 0 0 0 0 111 139 195 260 211 201 219 269 215
1086339 │ 680040.0 5.72187e6 0 0 0 0 0 43 149 216 234 204 174 196 263 199
1086340 │ 679320.0 5.72184e6 0 0 0 0 0 81 129 171 246 265 259 200 233 225
1086341 │ 679350.0 5.72184e6 0 0 0 0 0 153 228 241 238 246 213 234 259 252 ⋯
221 columns and 1086324 rows omitted
However, this behaviour does not seem to be documented anywhere in the official docs. Indeed, the only reason I know about this functionality is because I happened to stumble across this comment on the Julia Discourse. I feel that this feature is extremely useful for conducting further analysis of rasterized data and that it would be a good idea to advertise it more publicly. Perhaps we could add something to the docs?
Ok, that makes sense. I guess I need to handle these many-layered stack use cases. The problem is the current design is extremely fast for small stacks and generalising to large stacks will break that. So we will need a different stack type in DimensionslData.jl that you can choose to use for these workloads.
The Tables.jl interface is just kind of assumed, it could be better described.
If you would like to add documentation that would be very helpful, I dont have much time and it mostly goes to bug fixes and feature requests.
Yes, I agree that breaking optimization for small stacks, which likely represent the majority of use cases, would be a bad idea.
I can think of two ways forward for this issue:
RasterStack
(FatStack
?) or a wrapper around a standard RasterStack
.One trick that I found was to break large stacks up into smaller chunks, process each chunk separately, then merge the results. I'm guessing that this avoids the issue with large NamedTuples
.
Here's an example of a naive implementation that runs in 143.159 ms and produces 490833 allocations at 658.48 MB.
function _extract_signatures(stack, shp, row)
return extract(stack, shp[row,:geometry]) |> collect
end
And this is an optimized example, which runs in 19.426 ms and produces 13608 allocations at 73.26 MiB.
function _extract_signatures(stack, shp, row)
# Large Stacks
if length(names(stack)) > 25
# Partition Stack Layers Into Chunks of 25
stacks = [RasterStack([stack[c] for c in part]...) for part in Iterators.partition(names(stack), 25)]
# Extract Signatures From Each Chunk
ops = [_extract_signatures(stack, shp, row) for stack in stacks]
# Merge Results
return map(x -> reduce(merge, x), zip(ops...))
# Small Stacks
else
return extract(stack, shp[row,:geometry]) |> collect
end
end
The following method generalizes the above logic:
split_op_merge(stack, x -> map(a -> reduce(merge, a), zip(x...))) do x
return extract(x, shp[row,:geometry]) |> collect
end
Yes totally, those are the two options.
I think there are more a usecases for 2. in DimensionalData.jl, as it can also reduce compile time.
An idea I have is to make an MDimStack
(where M stands for mutable), where you can modify both layers and dimensions and indexing is moderately fast in all cases, but of course much slower for small stacks than DimStack/RasterStack
Its like DataFrames.jl vs TypedTables.jl, where currently DimStack/RasterStack are more like TypedTables, and MDimStack would be like DataFrames.jl.
We would also make it easy to switch between them
That sounds like the best idea. Do you think it's best to make users explicitly choose between the two, or are we planning on providing a common interface where one of MDimStack
or RasterStack
is automatically chosen behind the scenes? In the second case, I imagine something like a RasterStack
constructor that uses an MDimStack
for its underlying operations if the number of layers exceeds some threshold. Otherwise, it just uses the default RasterStack
as its data representation.
Im not sure really, and "planning" might mean after my phd in a year unfortunately, unless someone else writes it...
I noticed some performance issues while trying to iterate over a hyperspectral image consisting of 235 bands.
Upon further analysis, it seems that Base.getindex is type unstable for
RasterStacks
with a large number of layers.This is an example
RasterStack
with only 20 bands, which produces no warnings:And here is an example
RasterStack
with 100 bands, which produces a type warning:In particular, the issue seems to be caused by the type of %9, which is missing the
NTuple{100, Int16}
at the end.The result is that the first example consists of only a single allocation with 48 bytes while the second is 62 allocations with 3.77 KB according to
@btime
.