meggart / DiskArrays.jl

Other
72 stars 13 forks source link

`zip` triggers lots of invalidations #175

Open charleskawczynski opened 2 months ago

charleskawczynski commented 2 months ago

We have a workload over at ClimaAtmos.jl, where we use SnoopCompile's invalidations script, and the result is:

[ Info: 4296 methods invalidated for 178 functions (showing 10 functions)
┌─────────────────────────────────────────────────────────┬───────────────┬───────────────┬─────────────────┐
│ <file name>:<line number>                               │ Function Name │ Invalidations │ Invalidations % │
│                                                         │               │               │     (xᵢ/∑x)     │
├─────────────────────────────────────────────────────────┼───────────────┼───────────────┼─────────────────┤
│ DiskArrays/bZBJE/src/zip.jl:83                          │      zip      │     3247      │       14        │
│ DiskArrays/bZBJE/src/zip.jl:70                          │      zip      │     3247      │       14        │
│ DiskArrays/bZBJE/src/zip.jl:83                          │      zip      │     3247      │       14        │
│ FillArrays/eOEVm/src/fillbroadcast.jl:78                │  broadcasted  │     1220      │        5        │
│ CommonDataModel/G3moc/src/groupby.jl:267                │  broadcasted  │     1220      │        5        │
│ ClimaCore/ANgUC/src/MatrixFields/field_name_dict.jl:331 │  broadcasted  │     1220      │        5        │
│ ClimaCore/ANgUC/src/MatrixFields/field_name_dict.jl:475 │  broadcasted  │     1220      │        5        │
│ ClimaCore/ANgUC/src/MatrixFields/lazy_operators.jl:44   │  broadcasted  │     1220      │        5        │
│ GeometryBasics/ebXl0/src/offsetintegers.jl:40           │    convert    │      834      │        4        │
│ DiskArrays/bZBJE/src/generator.jl:53                    │   Generator   │      439      │        2        │
└─────────────────────────────────────────────────────────┴───────────────┴───────────────┴─────────────────┘

So, DiskArrays' zip seems to invalidate 9741 methods.

rafaqz commented 2 months ago

Ugh damn. We have a plan to fix zip in a more fundamental way but it will be a while off.

Is there anything specific about our definition that is leading to invalidations?

(Generators are also not great it seems)

charleskawczynski commented 2 months ago

It kind of looks like https://github.com/meggart/DiskArrays.jl/blob/64f4de86ec3ab921759d7372c19766a463c58e61/src/zip.jl#L83 is just clashing with https://github.com/meggart/DiskArrays.jl/blob/64f4de86ec3ab921759d7372c19766a463c58e61/src/zip.jl#L70. Shouldn't you just be able to delete https://github.com/meggart/DiskArrays.jl/blob/64f4de86ec3ab921759d7372c19766a463c58e61/src/zip.jl#L83?

rafaqz commented 2 months ago

It would be good to remove those, but NCDatasets.jl uses the macros to avoid inheriting AbstractVariable < : AbstractDiskArray

We (myself and @meggart) would very much like to remove those macros from the package but @Alexander-Barth may not.

charleskawczynski commented 2 months ago

Maybe this is a question for @Alexander-Barth, but why would a type that doesn't subtype AbstractDiskArray have zip return DiskZip?

rafaqz commented 2 months ago

DiskArrays provides a macro system to implement everything without explicit inheritance - the idea was that some packages wouldn't use the interface otherwise.

CommonDataModel.AbstractVariable does a lot of work in CommonDataModel.jl that doesn't always need disk array handling. So packages apply the macros to some concrete Variable types rather than making AbstractVariable <: AbstractDiskArray.

That causes a lot of issues, like this one. I'm hoping over time we can identify the reasons AbstractVariable doesn't need disk behaviors and add traits to DiskArrays to remove any performance loss for those types, and eventually switch everything over to just being <: AbstractDiskArray. But I'm not sure who could do that work when.

There is some discussion here: https://github.com/JuliaGeo/CommonDataModel.jl/issues/8

Alexander-Barth commented 2 months ago

I would not be opposed to remove the zip macro, but I am not sure to what extent changes are necessary on the CommonDataModel/NCDatasets side.

As we have seen Zarr and NetCDF require sometimes different trade-offs. Some dataset are also entirely in memory or "virtual" (in the sense that values are computed based on indices, such as longitude/latitude arrays based on the projection). As long as I would be able to provide more specific routines (without ambiguities) in case when an alternative implementations is more performant, then I can also subtype directly from AbstractDiskArray.

rafaqz commented 2 months ago

You can always provide more specific methods than AbstractDiskArray for specific variable types. There may be a few boilerplate methods needed to avoid ambiguity, yes.

What is required is a DiskArrays dependency in CommonDataModel.jl and defining AbstractVariable <: AbstrackDiskArray. Then probably default fallback methods to forward haschunks and eachunk to the parent array.

readblock! and writeblock! Would need to be defined for CFVariable and maybe some other wrapper arrays types.

Then you can delete the macros wherever they are applied.

(This will mean broadcasts, reductions and other nice things that are currently broken will work for CFDiskArray)