goat-community / goat

This is the home of Geo Open Accessibility Tool (GOAT)
GNU General Public License v3.0
89 stars 47 forks source link

Refactor read_heatmap to make it faster #1894

Closed metemaddar closed 1 year ago

metemaddar commented 1 year ago

At the moment, the read_heatmap takes about 16 seconds for resolution 6 and also 21 seconds for resolution 9, We can make it faster by:

  1. Make read_opportunity_matrix() multithread. As in this function we have to read about 18 resolutions. And these 18 read calls can get done together in multiple threads.
  2. Read bulk_ids from h3 cached files of resolution 6. This can let us get rid of additive resolutions, which also contain hexagons in resolution 10 and it reduces the calculation/sorting/masking time.
  3. We need to do data matching (matching with the resolution that user requests) in numba. At the moment these matchings take much time.
  4. For data matching we also run h3_parent(), which can not converted to numba. So first we need to vectorize this function and then match the results in numba.
  5. Before pass grid_ids to h3_parent(), we need to convert the grids from integer to strings. This can get done in numba by using int() function of python.

We could also refactor to not using dictionaries for categorizing data. We could instead use masking over red data. But this doesn't make program faster as we don't have many dictionary keys.

metemaddar commented 1 year ago
  1. For resolution 10, we should skip the h3_parent. As the red data are in resolution 10 before.
  2. After running the h3_parent, the results are in string (hexadecimal), we need to convert them back to integer to match final array. This step should run in numba using int function.
metemaddar commented 1 year ago

For 3, 4, 5 and 7, we could use h3._cy.parent to convert to integer parent directly: Untitled Diagram drawio However, we still had python for loops which was slow. Using cython, we could also speed up the for loops.

metemaddar commented 1 year ago

By reading the bulkd_ids from hexagon files, we could reduce the data without data outside of study area, So it reduced the time of sort and unique (And improve other functions too). And also, using cython we could have large improve in reordering data:

function before cython after cython
Reading matrices 751 ms 751 ms
sort_and_unique 1.78 s 1.55 s
do_calculations 60 ms 40 ms
read_hexagons ? ~0 ? ~0
tag_uniques_by_parent 10 s .6 s
create_grids_unordered_map 0 0
create_grid_pointers 2.57 s .4 s
create_calculation_arrays 13 ms 45 ms
create_quantile_arrays 26 ms 26 ms
generate_final_geojson 644 ms 860 ms
All 16.18 s 4.29 s
metemaddar commented 1 year ago

We have a bottle neck at (Sort and unique) which seems we can not improve it unless we sort the data before caching. For this approach, we need to sort and save data per study area. This also can improve the Reading matrices part which takes around .7 seconds. So we can reduce the overall time by ~ 1.5 seconds.

EPajares commented 1 year ago

As discussed, this is something we can refactor in the future.

p4b-bro[bot] commented 11 months ago

This task/issue closed on Tue Jun 06 2023 ✅