Open dimpase opened 7 years ago
I want to calculate the average (mean) distance (Euclidean or another metric) between two points.
I'm still working on it, so I can't be sure my hunch is right, but I suspect there must be a mathematical (and thus algorithmic) solution to generating a 1D-iterable of indices that can be passed to a Process, which then calls a function like scipy.spatial.distance.pdist
on a sub-array of the original created from those indices. I'm hoping such a formula will make it easy to produce native slice
objects for easy indexing. Once it's confirmed to work on the small simulated data in the tests/ directory, it'll be easier to figure out how to get it to work without loading a whole data file in at once.
I thought about opening this issue separately, but it's closely related enough to this one that I figured I'd just post here.
It occurs to me that this would be a lot easier to plan conceptually if the format of the intended data is specified. Right now I'm working on an update to the tests that creates a tempfile of the simulated data, so the specifics of how the code will interact with a datafile can be better anticipated. I think I heard @oliviaguest mention on twitter that the file is a csv with 50 million points, but I don't know if I'm remembering correctly or what the format of the file is. I've come across several potential solutions to mapping the file in memory without blowing up RAM usage, but it'll help me keep my constraints reasonable if I know how the data is meant to be formatted and needs to be read.
Yes, it's a csv file with a bunch of columns. Most of them are completely irrelevant except three. The ones we need for this computation are the latitude and longitude and a weight or count.
I also need the great_circle distance metric as I was wrong about Euclidean, so some flexibility is required with respect to the metric. I hacked some code to insert great_circle but I've not managed to run it on my data yet because it sucks up all the RAM. 😅
I must say that while I have experience with parallel coding, I never did it in Python, and now I came to realise that I was too optimistic, my apologies... In fact, Python (more precisely, CPython, the default implementation) has a huge problem here called GIL, which prevents it from using threading efficiently. The most obvious (to me) way out would be to write the thing in Cython, which lets you release GIL and use all the cores on shared memory, but then this part of code must not touch any Python objects, i.e. it must operate on "pure" C (or Cython) data (although it seems that numpy arrays can be used, too).
I'll try this. Thanks for the pointer!
Apparently one can also combine Cython's nogil
switch and Python's Thread from threading module. Something that is probably easier than use Cython's OpenMP parallelism model.
Aha, that's useful. So I've never used Cython properly. I'm not sure how successful I'll be — but I'll certainly try.
Thanks @oliviaguest, so would it be sufficient to simulate the data by simply having a csv (or npy, even) with dimensions (N, 3)? I'll assume each is a 64-bit float value, but it might save a bit of memory if a 32-bit float will suffice. Thanks for the note on the great circle distance as well. I should note that this is included in sklearn.metrics.DistanceMetric
under the name "Haversine." I don't have anything against geopy
, but since sklearn is already a dependency for generating the test data, it might let you avoid having another dependency for the project. I'll look into how scikit-learn's distance metric can be applied here, since it doesn't seem to be easily callable from the available pairwise-distance functions.
@dimpase: It's true that Python hasn't ever been terribly good at handling threads, since it's not really a scientific computing language, but I should know that the GIL problem you're mentioning only applies to Python 3, while this project is being written in Python 2, so multi-threading should still have an effect. The GIL is only a problem in Python 2 if your project uses a lot of native Python objects, but we're mostly dealing with NumPy datatypes, which are compiled from C and FORTRAN and thus get around problems with the GIL most of the time, as you noted.
@GrantRVD no, GIL behaviour is common to Pythons 2 and 3, see e.g. official Python 2 docs, which says "CPython implementation detail: In CPython, due to the Global Interpreter Lock, only one thread can execute Python code at once."
It's more about Python being designed in pre-multicore times than anything else, and an implementation detail of CPython (for there exist Python implementations which don't have GIL at all, or allow it being flipped on and off).
@GrantRVD I'm not sure why you think the csv is that important to stimulate, but yes, feel free to stimulate it as a three column one. 32-bit floats are fine. Precision of that type is not required.
My understanding is that GIL is the same for all Python implementations.
I'm only loading sklearn for the test data, not in my project with the real data.
Thinking about this more and more. I'm going to try a different tack, getting this to work in series without running out of memory. I have also looked into postgres which apparently can do this calculation in parallel (?) in a couple of days.
I also have to move to another project for next week. So I will be less able to check stuff out, but obviously around.
here is the way to partition pairs between k
workers (almost) perfectly, and without the need to create pairs at all (see the line computing r
)
from __future__ import division, print_function
def pdistsum(X, dist, k=1):
"""
compute the sum of distances, given by dist(,), between elements of X
using k threads
"""
import threading
from math import sqrt, ceil
N = len(X)
results = [None]*k
def rsum(r0, r1, num):
s = 0.0
c = 0
print("rsum instance "+str(num)+" from "+str(r0)+" to "+str(r1))
for i in xrange(r0,r1):
for j in xrange(i+1,N):
c += 1
s += dist(X[i],X[j])
results[num]=(s,c)
return
# k+1 ranges: 0,N(1-sqrt((k-1)/k)),N(1-sqrt((k-2)/k)),...,N
r = [0]+map(lambda i: int(ceil(N*(1-sqrt((k-i)/k)))), xrange(1,k))+[N]
threads = []
for i in range(k):
t = threading.Thread(target=rsum, args=(r[i],r[i+1],i,))
threads.append(t)
t.start()
for i in range(k):
threads[i].join()
return results # we should actually return sum(results[i][0] for i in range(k))
### testing ####
def exdist(a,b): # example of dist() between 2 numbers
from math import sqrt
return sqrt(abs(a-b))
pdistsum(range(10**3),exdist,k=4) # example with 4 threads
So this would be parallel, if not GIL....
OMG that's beautiful. Thank you so much. Playing with it now — likely to make it uglier as I need some small additions but thank you so so much. 🤗
Still need to get my head around your code 100% @dimpase but I think there is a bug (assertion fails when comparing to serial scipy)? Check out a version where I compare it to pdist here (with the great_circle metric and with some test data, although no weights yet — weights set to one): https://github.com/oliviaguest/pairwise_distance/blob/master/dima_version.py
I will keep at it though and figure out what I've likely misunderstood.
@oliviaguest a bug is is introduced in lines 39-40 of https://github.com/oliviaguest/pairwise_distance/blob/master/dima_version.py
for result in results:
total_sum += results[0][0]
you meant
for result in results:
total_sum += result[0]
@dimpase are you sure? I tried it your way first. I get this error raised:
TypeError: unsupported operand type(s) for +=: 'int' and 'tuple'
didn't you mix up result
and results
?
Haha 😜 yes, yes, I did! Thank you!
Computing pairwise distances (as a matrix, or otherwise) for more than (say) 500K points is not meaningful, as the difficulty to allocate and access this much space overweights the speedup you might have.
A better question is: what do you want to do with these huge matrices?