goat-community / goat

This is the home of Geo Open Accessibility Tool (GOAT)
GNU General Public License v3.0
89 stars 47 forks source link

Compute mathematical operations on two numpy arrays #1783

Closed EPajares closed 1 year ago

EPajares commented 1 year ago

Goal of this issue

We do have two numpy arrays.

  1. Grid_ids: In the array we have duplicated that indicate that we have several calculations per grid_id. So an example could be having 150k unique values but a grid with in total 1.5 Mio entries.

  2. Traveltimes: In this array we have traveltime from e.g. 0-60 minutes. The size of the array is the same as the grid_ids and they are in the same order as the grids. ==> travel_times[0] is the value for grid_ids[0] etc...

Both arrays can be considered integers, while the grid_ids are big integers. With this data structure we need to perform different things like min(travel_time) for each grid_id or the average travel time per grid_id. But we also have more complex logic for the accessibility calculations. I know we can do this with pd.DataFrame but it is slow as we need to load the data first in the dataframe. Therefore I was thinking about a solution in numba. For all functions we need to loop through boths arrays.

I would try starting with the following:

The functions could look like this: min_per_grid(grid_ids, travel_times, max_travel_time)

The max_travel_time is a cutoff value. So if the user says 20 then this mean all values 20 >= should be considered for the operations. @metemaddar can you give it a try inside this file:

https://github.com/goat-community/goat/blob/feature/walking-matrix/app/api/src/core/heatmap.py

You can maybe for now just work with some dummy arrays that you generate using numpy.

Resources

We might have to use numba for this as it is more then just maths in numpy but we need some conditions.

Deliverables

Branch to derive

https://github.com/goat-community/goat/blob/feature/walking-matrix/

metemaddar commented 1 year ago

We can have it like this:

import numpy as np
small = 15
big = 50
print("run:")
grid_ids = np.random.randint(0,small,big,np.uint32)
print(grid_ids.shape)
travel_time = np.random.randint(1,120,big,np.uint8)
# print(travel_time.shape)

table_ = np.vstack((grid_ids,travel_time))
table = table_.transpose()
print(table.shape)
# print(table)
sorted_table = table[table[:,0].argsort()]
unique = np.unique(sorted_table[:,0],return_index=True)
print(unique[0])
table_splited = np.split(sorted_table[:,1].transpose(),unique[1][1:])
print(table_splited)
avarages = np.empty(unique[1].shape[0],np.float32)
medians = np.empty(unique[1].shape[0],np.float32)
mins = np.empty(unique[1].shape[0],np.uint32)
counts = np.empty(unique[1].shape[0],np.uint8)
for i,travel_time in enumerate(table_splited):
    avarages[i]=np.average(travel_time)
    mins[i] = np.min(travel_time)
    medians[i] = np.median(travel_time)
    counts[i] = travel_time.shape[0]
print(avarages)
print(mins)
print(medians)
print(counts)
metemaddar commented 1 year ago

Instead of using np.split we can loop over unique indexes for our loop, so it decreases the memory consumption and increases the speed.

import numpy as np
from time import time
from numba import njit
small = 150000
big = 1500000
print("run:")
grid_ids = np.random.randint(0,small,big,np.uint32)
print(grid_ids.shape)
travel_time = np.random.randint(1,120,big,np.uint8)
table_ = np.vstack((grid_ids,travel_time))
table = table_.transpose()
sorted_table = table[table[:,0].argsort()]
unique = np.unique(sorted_table[:,0],return_index=True)
@njit
def calc(sorted_table, unique):
    # unique2 = np.hstack((unique[1],np.array([sorted_table.shape[0]-1])))
    unique2 = np.concatenate((unique[1],np.array([sorted_table.shape[0]-1])))
    avarages = np.empty(unique[1].shape[0],np.float32)
    medians = np.empty(unique[1].shape[0],np.float32)
    mins = np.empty(unique[1].shape[0],np.uint8)
    counts = np.empty(unique[1].shape[0],np.uint8)
    for i in range(unique2.shape[0]-1):
        travel_time = sorted_table[unique2[i]:unique2[i+1]:1]
        avarages[i]=np.average(travel_time)
        mins[i] = np.min(travel_time)
        medians[i] = np.median(travel_time)
        counts[i] = travel_time.shape[0]

calc(sorted_table, unique)

start_time = time()
table_ = np.vstack((grid_ids,travel_time))
table = table_.transpose()
sorted_table = table[table[:,0].argsort()]
unique = np.unique(sorted_table[:,0],return_index=True)
calc(sorted_table, unique)
end_time=time()
print(f"time: {end_time-start_time} s")

for 1.5M data:

run:
(1500000,)
time: 0.20011115074157715 s
EPajares commented 1 year ago

Thanks @metemaddar this looks good. Can we make this faster? And actually we need to split each operations in a seperate functions. So they operate independent form each other.

Each individual function should take approx. 0.01s can we achieve this?

metemaddar commented 1 year ago

If we separate the functions, most of functions don't take more than 0.01s. But the preparation time (Mostly sorting takes more time about 0.14s). Here is another test:

import numpy as np
from time import time
from numba import njit, prange
small = 150000
big = 1500000
print("run:")
grid_ids = np.random.randint(0,small,big,np.uint32)
print(grid_ids.shape)
travel_time = np.random.randint(1,120,big,np.uint8)
table_ = np.vstack((grid_ids,travel_time))
table = table_.transpose()
sorted_table = table[table[:,0].argsort()]
unique = np.unique(sorted_table[:,0],return_index=True)

@njit()
def medians(sorted_table, unique):
    travel_times = sorted_table.transpose()[1]
    medians = np.empty(unique[1].shape[0],np.float32)
    for i in range(unique[1].shape[0]-1):
        travel_time = travel_times[unique[1][i]:unique[1][i+1]]
        medians[i] = np.median(travel_time)
    else:
        travel_time = travel_times[unique[1][i+1]:]
        medians[i+1] = np.median(travel_time)
    return medians

@njit()
def mins(sorted_table, unique):
    travel_times = sorted_table.transpose()[1]
    mins = np.empty(unique[1].shape[0],np.float32)
    for i in range(unique[1].shape[0]-1):
        travel_time = travel_times[unique[1][i]:unique[1][i+1]]
        mins[i] = np.min(travel_time)
    else:
        travel_time = travel_times[unique[1][i+1]:]
        mins[i+1] = np.min(travel_time)
    return mins

@njit()
def counts(sorted_table, unique):
    travel_times = sorted_table.transpose()[1]
    counts = np.empty(unique[1].shape[0],np.float32)
    for i in range(unique[1].shape[0]-1):
        travel_time = travel_times[unique[1][i]:unique[1][i+1]]
        counts[i] = travel_time.shape[0]
    else:
        travel_time = travel_times[unique[1][i+1]:]
        counts[i+1] = travel_time.shape[0]
    return counts

@njit()
def averages(sorted_table, unique):
    travel_times = sorted_table.transpose()[1]
    averages = np.empty(unique[1].shape[0],np.float32)
    for i in range(unique[1].shape[0]-1):
        travel_time = travel_times[unique[1][i]:unique[1][i+1]]
        averages[i] = np.average(travel_time)
    else:
        travel_time = travel_times[unique[1][i+1]:]
        averages[i+1] = np.average(travel_time)
    return averages

medians(sorted_table, unique)
mins(sorted_table, unique)
counts(sorted_table, unique)
averages(sorted_table, unique)

start_time_0 = time()
table_ = np.vstack((grid_ids,travel_time))
table = table_.transpose()
start_time = time()
sorted_table = table[table[:,0].argsort()]
end_time=time()
print(f"sorting time: {end_time-start_time} s")
start_time = time()
unique = np.unique(sorted_table[:,0],return_index=True)
end_time=time()
print(f"finding uniques time: {end_time-start_time} s")
start_time = time()
medians(sorted_table, unique)
end_time=time()
print(f"median time: {end_time-start_time} s")
start_time = time()
mins(sorted_table, unique)
end_time=time()
print(f"mins time: {end_time-start_time} s")
start_time = time()
counts(sorted_table, unique)
end_time=time()
print(f"counts time: {end_time-start_time} s")
start_time = time()
averages(sorted_table, unique)
end_time=time()
print(f"averages time: {end_time-start_time} s")
print(f"full time: {end_time-start_time_0} s")

Results:

sorting:        0.1496894359588623 s

finding uniques:    0.012337207794189453 s
median:         0.04173874855041504 s
mins:           0.0034613609313964844 s
counts:         0.00020384788513183594 s
averages:       0.0033130645751953125 s

full:           0.21166563034057617 s
metemaddar commented 1 year ago

For 15 M records of 1.5 M different data for grid_ids the result is as following:

sorting time: 2.386554718017578 s
finding uniques time: 0.2212386131286621 s
median time: 6.947429180145264 s
mins time: 0.03871321678161621 s
counts time: 0.002310514450073242 s
averages time: 0.04162120819091797 s
full time: 9.65637469291687 s

I need to profile the memory consumption as well.

EPajares commented 1 year ago

This is great. We now need a bit more complex functions. Some of them are specific to accessibility.

Modified Gaussian function:

\Huge
f(t_{ij})=e^{(-t_{ij}^{2}/\beta)}
t_{ij} : travel\,time
\beta:sensitivity

This function should receive different sensitivity parameters beta. Furthermore, we would like to apply a cutoff value. So basically, a value that excludes travel time higher than a certain threshold.

The function could look like this: modified_gaussian_per_grid(sorted_table, unique, sensitivity, cutoff)

metemaddar commented 1 year ago

And which travel time should we use? Avarage, min, median or all?

And if we get higher than cut-off should we use cut-off as t?

EPajares commented 1 year ago

Sorry forgot to mention that this is done for each travel time and needs to be summed up at the end per grid. If the value is above the cutoff we should just return 0