Closed EPajares closed 1 year ago
We can have it like this:
import numpy as np
small = 15
big = 50
print("run:")
grid_ids = np.random.randint(0,small,big,np.uint32)
print(grid_ids.shape)
travel_time = np.random.randint(1,120,big,np.uint8)
# print(travel_time.shape)
table_ = np.vstack((grid_ids,travel_time))
table = table_.transpose()
print(table.shape)
# print(table)
sorted_table = table[table[:,0].argsort()]
unique = np.unique(sorted_table[:,0],return_index=True)
print(unique[0])
table_splited = np.split(sorted_table[:,1].transpose(),unique[1][1:])
print(table_splited)
avarages = np.empty(unique[1].shape[0],np.float32)
medians = np.empty(unique[1].shape[0],np.float32)
mins = np.empty(unique[1].shape[0],np.uint32)
counts = np.empty(unique[1].shape[0],np.uint8)
for i,travel_time in enumerate(table_splited):
avarages[i]=np.average(travel_time)
mins[i] = np.min(travel_time)
medians[i] = np.median(travel_time)
counts[i] = travel_time.shape[0]
print(avarages)
print(mins)
print(medians)
print(counts)
Instead of using np.split
we can loop over unique
indexes for our loop, so it decreases the memory consumption and increases the speed.
import numpy as np
from time import time
from numba import njit
small = 150000
big = 1500000
print("run:")
grid_ids = np.random.randint(0,small,big,np.uint32)
print(grid_ids.shape)
travel_time = np.random.randint(1,120,big,np.uint8)
table_ = np.vstack((grid_ids,travel_time))
table = table_.transpose()
sorted_table = table[table[:,0].argsort()]
unique = np.unique(sorted_table[:,0],return_index=True)
@njit
def calc(sorted_table, unique):
# unique2 = np.hstack((unique[1],np.array([sorted_table.shape[0]-1])))
unique2 = np.concatenate((unique[1],np.array([sorted_table.shape[0]-1])))
avarages = np.empty(unique[1].shape[0],np.float32)
medians = np.empty(unique[1].shape[0],np.float32)
mins = np.empty(unique[1].shape[0],np.uint8)
counts = np.empty(unique[1].shape[0],np.uint8)
for i in range(unique2.shape[0]-1):
travel_time = sorted_table[unique2[i]:unique2[i+1]:1]
avarages[i]=np.average(travel_time)
mins[i] = np.min(travel_time)
medians[i] = np.median(travel_time)
counts[i] = travel_time.shape[0]
calc(sorted_table, unique)
start_time = time()
table_ = np.vstack((grid_ids,travel_time))
table = table_.transpose()
sorted_table = table[table[:,0].argsort()]
unique = np.unique(sorted_table[:,0],return_index=True)
calc(sorted_table, unique)
end_time=time()
print(f"time: {end_time-start_time} s")
for 1.5M data:
run:
(1500000,)
time: 0.20011115074157715 s
Thanks @metemaddar this looks good. Can we make this faster? And actually we need to split each operations in a seperate functions. So they operate independent form each other.
Each individual function should take approx. 0.01s can we achieve this?
If we separate the functions, most of functions don't take more than 0.01s. But the preparation time (Mostly sorting takes more time about 0.14s). Here is another test:
import numpy as np
from time import time
from numba import njit, prange
small = 150000
big = 1500000
print("run:")
grid_ids = np.random.randint(0,small,big,np.uint32)
print(grid_ids.shape)
travel_time = np.random.randint(1,120,big,np.uint8)
table_ = np.vstack((grid_ids,travel_time))
table = table_.transpose()
sorted_table = table[table[:,0].argsort()]
unique = np.unique(sorted_table[:,0],return_index=True)
@njit()
def medians(sorted_table, unique):
travel_times = sorted_table.transpose()[1]
medians = np.empty(unique[1].shape[0],np.float32)
for i in range(unique[1].shape[0]-1):
travel_time = travel_times[unique[1][i]:unique[1][i+1]]
medians[i] = np.median(travel_time)
else:
travel_time = travel_times[unique[1][i+1]:]
medians[i+1] = np.median(travel_time)
return medians
@njit()
def mins(sorted_table, unique):
travel_times = sorted_table.transpose()[1]
mins = np.empty(unique[1].shape[0],np.float32)
for i in range(unique[1].shape[0]-1):
travel_time = travel_times[unique[1][i]:unique[1][i+1]]
mins[i] = np.min(travel_time)
else:
travel_time = travel_times[unique[1][i+1]:]
mins[i+1] = np.min(travel_time)
return mins
@njit()
def counts(sorted_table, unique):
travel_times = sorted_table.transpose()[1]
counts = np.empty(unique[1].shape[0],np.float32)
for i in range(unique[1].shape[0]-1):
travel_time = travel_times[unique[1][i]:unique[1][i+1]]
counts[i] = travel_time.shape[0]
else:
travel_time = travel_times[unique[1][i+1]:]
counts[i+1] = travel_time.shape[0]
return counts
@njit()
def averages(sorted_table, unique):
travel_times = sorted_table.transpose()[1]
averages = np.empty(unique[1].shape[0],np.float32)
for i in range(unique[1].shape[0]-1):
travel_time = travel_times[unique[1][i]:unique[1][i+1]]
averages[i] = np.average(travel_time)
else:
travel_time = travel_times[unique[1][i+1]:]
averages[i+1] = np.average(travel_time)
return averages
medians(sorted_table, unique)
mins(sorted_table, unique)
counts(sorted_table, unique)
averages(sorted_table, unique)
start_time_0 = time()
table_ = np.vstack((grid_ids,travel_time))
table = table_.transpose()
start_time = time()
sorted_table = table[table[:,0].argsort()]
end_time=time()
print(f"sorting time: {end_time-start_time} s")
start_time = time()
unique = np.unique(sorted_table[:,0],return_index=True)
end_time=time()
print(f"finding uniques time: {end_time-start_time} s")
start_time = time()
medians(sorted_table, unique)
end_time=time()
print(f"median time: {end_time-start_time} s")
start_time = time()
mins(sorted_table, unique)
end_time=time()
print(f"mins time: {end_time-start_time} s")
start_time = time()
counts(sorted_table, unique)
end_time=time()
print(f"counts time: {end_time-start_time} s")
start_time = time()
averages(sorted_table, unique)
end_time=time()
print(f"averages time: {end_time-start_time} s")
print(f"full time: {end_time-start_time_0} s")
Results:
sorting: 0.1496894359588623 s
finding uniques: 0.012337207794189453 s
median: 0.04173874855041504 s
mins: 0.0034613609313964844 s
counts: 0.00020384788513183594 s
averages: 0.0033130645751953125 s
full: 0.21166563034057617 s
For 15 M records of 1.5 M different data for grid_ids the result is as following:
sorting time: 2.386554718017578 s
finding uniques time: 0.2212386131286621 s
median time: 6.947429180145264 s
mins time: 0.03871321678161621 s
counts time: 0.002310514450073242 s
averages time: 0.04162120819091797 s
full time: 9.65637469291687 s
I need to profile the memory consumption as well.
This is great. We now need a bit more complex functions. Some of them are specific to accessibility.
Modified Gaussian function:
\Huge
f(t_{ij})=e^{(-t_{ij}^{2}/\beta)}
t_{ij} : travel\,time
\beta:sensitivity
This function should receive different sensitivity parameters beta. Furthermore, we would like to apply a cutoff value. So basically, a value that excludes travel time higher than a certain threshold.
The function could look like this:
modified_gaussian_per_grid(sorted_table, unique, sensitivity, cutoff)
And which travel time should we use? Avarage, min, median or all?
And if we get higher than cut-off should we use cut-off as t?
Sorry forgot to mention that this is done for each travel time and needs to be summed up at the end per grid. If the value is above the cutoff we should just return 0
Goal of this issue
We do have two numpy arrays.
Grid_ids: In the array we have duplicated that indicate that we have several calculations per grid_id. So an example could be having 150k unique values but a grid with in total 1.5 Mio entries.
Traveltimes: In this array we have traveltime from e.g. 0-60 minutes. The size of the array is the same as the grid_ids and they are in the same order as the grids. ==> travel_times[0] is the value for grid_ids[0] etc...
Both arrays can be considered integers, while the grid_ids are big integers. With this data structure we need to perform different things like min(travel_time) for each grid_id or the average travel time per grid_id. But we also have more complex logic for the accessibility calculations. I know we can do this with pd.DataFrame but it is slow as we need to load the data first in the dataframe. Therefore I was thinking about a solution in numba. For all functions we need to loop through boths arrays.
I would try starting with the following:
The functions could look like this: min_per_grid(grid_ids, travel_times, max_travel_time)
The max_travel_time is a cutoff value. So if the user says 20 then this mean all values 20 >= should be considered for the operations. @metemaddar can you give it a try inside this file:
https://github.com/goat-community/goat/blob/feature/walking-matrix/app/api/src/core/heatmap.py
You can maybe for now just work with some dummy arrays that you generate using numpy.
Resources
We might have to use numba for this as it is more then just maths in numpy but we need some conditions.
Deliverables
Branch to derive
https://github.com/goat-community/goat/blob/feature/walking-matrix/