mthh / contour-rs

Contour polygon creation in Rust (using marching squares algorithm)
https://crates.io/crates/contour
Apache License 2.0
50 stars 10 forks source link

[FR] Sparse representation of input #17

Open netthier opened 2 months ago

netthier commented 2 months ago

I have a dataset where the vast majority of values are nodata values (https://github.com/mthh/contour-rs/issues/16). The bounding box is substantially larger than what would be required if the data were e.g. tiled or stored in some other compressed format. This prevents me from doing calculations at the precision I'd like to, as I'm running out of memory. I think it would be useful if contour-rs accepted grids of formats other than Vec. Maybe making the methods generic over some trait that provides fn get_value(x: usize, y: usize) -> Option<V> could work? Such a trait could also help solving https://github.com/mthh/contour-rs/issues/16, since an implementation could return None for such values.

Apparently there are also optimized variants of marching squares for sparse matrices, maybe it would be useful to choose a type of input that would allow to take advantage of them, e.g. https://www.diva-portal.org/smash/record.jsf?pid=diva2%3A1675180&dswid=-8922

mthh commented 2 months ago

It's a good idea, let me think about it for a moment and discuss the solutions available to us here.

netthier commented 3 weeks ago

I have started implementing this as I need this feature: https://github.com/SenseLabsDE/contour-rs/tree/grid What I did was add a trait like this to abstract over containers:

pub trait Grid<V> where V: GridValue {
    fn extents(&self) -> impl IntoIterator<Item = Extent>;
    fn size(&self) -> (usize, usize);
    fn get_point(&self, coord: Coord<i64>) -> Option<V>;
}

pub struct Extent {
    pub top_left: Coord<i64>,
    pub bottom_right: Coord<i64>,
}

and new container types implementing said trait. One of them is TiledBuffer which implements my use case, and using it speeds up the contour calculations for my test dataset (same one I sent you in the past I think) from 20s to 3s (for 6 contour levels and a tile size of 256).

I haven't really tested performance for regular rectangular datasets though. I've now fixed the benchmarks, and the performance is about the same or slightly worse as in https://github.com/mthh/contour-rs/pull/14

I've also done a lot of opinionated changes on that branch, which may need to be undone if you want to merge this, like removing the Float type. My reasoning there was that floating point precision errors are insidious and can lead to wrong results, and I'm not currently aware of a nice way of preventing users from accidentally introducing them (there probably is some magic maximum size where these errors start occurring with f32, but determining it with 100% certainty seems difficult). Always using the highest precision fixes that. I'm not sure what originally motivated f32 support, but if it was memory usage due to the lack of generic grid values, then that isn't an issue anymore anyways.

mthh commented 2 weeks ago

Thanks for proposing to contribute this new feature, it sounds like a very useful addition.

And I'm on your side regarding the deletion of the Float type to avoid the bugs encountered previously and given that in your contribution we can now use f32 type data as input since GridValue is generic on different numeric types.

I'm not sure what originally motivated f32 support, but if it was memory usage due to the lack of generic grid values, then that isn't an issue anymore anyways.

@hakolao Would such a generic implementation on the type of numerical values used in the input grid (thus allowing the use of grids with f64 or f32 values), but whose returned geometries have their coordinates in f64, correspond to your use case / to the use case for which you had contributed the f32 feature ?

I've now fixed the benchmarks, and the performance is about the same or slightly worse as in https://github.com/mthh/contour-rs/pull/14

I've also done a lot of opinionated changes on that branch, which may need to be undone if you want to merge this, like removing the Float type.

I'll take a closer look at the code and at the benchmarks (current implementation vs. that of #14 vs. that of your grid branch) by the end of the week but I'd be happy to see this proposal merged, as it seems to provide a significant gain for some use cases (the dataset you were working on seems to me to be a case that can be encountered relatively often). In addition, having better management of no data values is certainly a good thing.

netthier commented 2 weeks ago

My grid branch probably won't be ready by the end of the week, there are still some issues I'm figuring out

  1. I'm currently just skipping any square containing at least one nodata value, which is a bit hacky. I'm planning to implement GDAL's algorithm for extending lines properly to the edges. (If I can figure out how exactly it works that is...) For reference, edges currently look like this: image

  2. There is an off-by-one error with the way extents are currently handled. For the contour lines to extend all the way to the edge, the extent needs to add one row/column on each side so that those squares get computed as well. But in the case of the TiledBuffer, doing this without the extents overlapping is hard, and I'm still figuring it out. At the moment, calculating a square twice seems to result in incorrect output, so this should be fixed. My current plan of solving this is to not actually extend the extents, but to add extra extents for those border parts.

And additional benchmarks with criterion have shown a bit of a stronger performance decline compared to #14.

EDIT: I implemented my solution for 2., but haven't tested it for correctness yet. It looks correct on a surface level though.