Refactor read_heatmap to make it faster

metemaddar commented 1 year ago

At the moment, the read_heatmap takes about 16 seconds for resolution 6 and also 21 seconds for resolution 9, We can make it faster by:

Make read_opportunity_matrix() multithread. As in this function we have to read about 18 resolutions. And these 18 read calls can get done together in multiple threads.
Read bulk_ids from h3 cached files of resolution 6. This can let us get rid of additive resolutions, which also contain hexagons in resolution 10 and it reduces the calculation/sorting/masking time.
We need to do data matching (matching with the resolution that user requests) in numba. At the moment these matchings take much time.
For data matching we also run h3_parent(), which can not converted to numba. So first we need to vectorize this function and then match the results in numba.
Before pass grid_ids to h3_parent(), we need to convert the grids from integer to strings. This can get done in numba by using int() function of python.

We could also refactor to not using dictionaries for categorizing data. We could instead use masking over red data. But this doesn't make program faster as we don't have many dictionary keys.

metemaddar commented 1 year ago

For resolution 10, we should skip the h3_parent. As the red data are in resolution 10 before.
After running the h3_parent, the results are in string (hexadecimal), we need to convert them back to integer to match final array. This step should run in numba using int function.

metemaddar commented 1 year ago

For 3, 4, 5 and 7, we could use h3._cy.parent to convert to integer parent directly: Untitled Diagram drawio However, we still had python for loops which was slow. Using cython, we could also speed up the for loops.

metemaddar commented 1 year ago

By reading the bulkd_ids from hexagon files, we could reduce the data without data outside of study area, So it reduced the time of sort and unique (And improve other functions too). And also, using cython we could have large improve in reordering data:

function	before cython	after cython
Reading matrices	751 ms	751 ms
sort_and_unique	1.78 s	1.55 s
do_calculations	60 ms	40 ms
read_hexagons	? ~0	? ~0
tag_uniques_by_parent	10 s	.6 s
create_grids_unordered_map	0	0
create_grid_pointers	2.57 s	.4 s
create_calculation_arrays	13 ms	45 ms
create_quantile_arrays	26 ms	26 ms
generate_final_geojson	644 ms	860 ms
All	16.18 s	4.29 s

metemaddar commented 1 year ago

We have a bottle neck at (Sort and unique) which seems we can not improve it unless we sort the data before caching. For this approach, we need to sort and save data per study area. This also can improve the Reading matrices part which takes around .7 seconds. So we can reduce the overall time by ~ 1.5 seconds.

EPajares commented 1 year ago

As discussed, this is something we can refactor in the future.

p4b-bro[bot] commented 11 months ago

This task/issue closed on Tue Jun 06 2023 ✅

goat-community / goat

Refactor read_heatmap to make it faster #1894