Open snakers4 opened 6 years ago
Ran a bech on these params for 500k * 50 data:
umap = umap.UMAP(
n_neighbors=200,
min_dist=0.0,
n_components=2,
metric='euclidean'
)
umap = umap.UMAP(
n_neighbors=30,
min_dist=0.0,
n_components=2,
metric='euclidean'
)
Nothing more or less happened for hours.
Then I just tried:
umap = umap.UMAP(
n_components=2
)
And after and hour or so it started using all the processors
@snakers4 I don't think there is direct GPU support, but it is listed on the road map under no priority. UMAP source code does make use of numba, which can be used to target the GPU, but you would have to make those changes yourself.
The current dataset I use is 120k * 15, but I am running an searching algorithm to find the best possible combination of HDBSCAN and UMAP parameters using multiprocessing.
If you are using linux, I would open command line and type top
which will show you the top running processes. This will let you know what python is doing, I am willing to bet that python with have one process running the cpu at 100% (or close too) when you use that large of a dataset.
I hope this helps,
Ralph
There are a few things to deal with here.
First, for your dataset (500,000) you will need code that is currently in a branch not merged with master. There are some issues once you get beyond 100,000 points that I have finally resolved but not merged.
Second, it will use vectorized instructions as much as numba will let it -- numba is a library that can take numerical python code and compile it to LLVM, so whatever LLVM and numba can do will be happenining. Since I'm still working at the algorithmic level I have not concerned myself with that level of optimization.
Third, if it is using all CPUs and taking a long time ... that means that the spectral embedding is busy failing since there isn't a large enough eigengap. Try using init='random'
to avoid this. I'm still trying to figure out the right way to handle that sort of issue.
Fourth, try to keep n_neighbors
small; the larger it is, especially for large datasets, the worse the performance. Part of the efficiency of UMAP comes from its ability to make use of much smaller nearest neighbor sets than comparable algorithms.
Fifth, I do recommend trying on a sub-sample of your dataset first to make sure things are working before scaling up.
Finally, UMAP ultimately relies on SGD for a portion of the computation. Of course, as these things go it is a bit of a custom SGD, so what is there now is essentially one I wrote myself from scratch. Presumably deep learning libraries that make use of GPUs can be suitably coerced into doing the right kind of optimization, but that is beyond my current knowledge (as it requires careful phrasing of the problem I suspect). If you would like to look into that we can discuss the possibilities further.
Hi, first of all many thanks for you replies)
For this dataset PCA (5000 => 2) + eyeballing the clusters worked fine for some reason. Data science is magic sometimes. I guess clusters are linearly separable. But this is a good benchmark nevertheless (see my comments below).
Also it may be worth reporting that my colleague reported running 2M * 10 datasets w/o problems just using standard params w/o any tweaking some time ago.
First, for your dataset (500,000) you will need code that is currently in a branch not merged with master
I guess I will wait for the code to be merged.
Try using init='random' try to keep n_neighbors small trying on a sub-sample of your dataset
Tried this
import umap
umap = umap.UMAP(
n_neighbors=5,
n_components=2,
metric='euclidean',
init='random'
)
umap_vectors = umap.fit_transform(pca_vectors[0:50000])
Well, your advice helped:
As for the evaluation:
Just visually eyeballing the clusters worked - they are really similar.
I will report the visual composition of the above clusters a bit later.
Finally, UMAP ultimately relies on SGD for a portion of the computation. Of course, as these things go it is a bit of a custom SGD, so what is there now is essentially one I wrote myself from scratch. Presumably deep learning libraries that make use of GPUs can be suitably coerced into doing the right kind of optimization, but that is beyond my current knowledge (as it requires careful phrasing of the problem I suspect). If you would like to look into that we can discuss the possibilities further.
Well, if you could share some minimum reading (i.e. 1 paper on this topic you based this portion of the SGD pipeline on) and point to the part of code with SGD - if I am up to the task I will think about applying my skills to that.
In practice (work / competitions) I noticed that Adam is better than plain SGD in most of cases. Modern high level frameworks (pytorch) have this implemented + they have GPU support, which may be a nice feature.
Visually evaluated UMAP embeddings. They are really great. They seem to be like PCA embeddings, but further refined into couple sub-categories. But as you noticed, 100k+ fails.
This has been resolved (or should be resolved) as of v0.2.0 (i.e. the branch got merged). With regard to the SGD part of the code you would want optimize_layout
function. The most relevant paper would probably be the LargeVis paper which this bears some similarity to.
@snakers4
import pandas as pd # import pandas
import numpy as np # import numpy
from multiprocessing import Process # import from standard python libs for multi-core threading and processing
import glob # import to iterate through files in directories
import os.path # import for creating os path objects
import matplotlib.pyplot as plt # import used to create and save scatter plots
#from mpl_toolkits.mplot3d import Axes3D
from numpy import genfromtxt # numpy import to read txt files as csv's
import umap # the embedding algorithm, UMAP, this is the reduction method (or class) used
#from sklearn.decomposition import PCA
#from sklearn.preprocessing import MinMaxScaler
def data_grab(path,filename,fraction): # this is the function that grads the dataset
df = pd.read_csv(str(path)+str(filename)) # reading the dataset from a csv file and making it a pandas dataframe
df = df.select_dtypes(include=[np.number]) # selecting only the columsn that contain float/int, columns with NaN, missing entries and strings are ignored
df = df.sample(frac=fraction, random_state = 32).reset_index(drop = True) # this takes a sample of our dataset to use for embedding, the random sate controls the rand seed
df = df.as_matrix() # this converts the dataframe into a matrix form, similar to the form that a numpy ndarray would look
return df # return our 'fake' ndarray
def UMAP(n_neighbors, min_dist, metric, input_data): # this is the embedding function
global fraction, filename # we need to keep the filename of our dataset for folder creation, same for fraction
try:
data = umap.UMAP(n_neighbors = n_neighbors, min_dist = min_dist, metric = metric, n_components = 2).fit_transform(input_data)
except:
print('Embedding_Failed')
return None
df_reduct = pd.DataFrame(data) # our lower embedded data is saved as a pandas dataframe
return df_reduct.to_csv('%s_embedding_%s/embedding,%s,%s,%s.csv' %(filename,fraction,n_neighbors, min_dist, metric)) # we are saving our embedding to a csv
def savefig(file): # this function creates the scatter plots and saves them, currently the cmap, or colouring does not work with my distro of anaconda (install problem) you can add them if they work for you
global fraction, filename # make sure we maintain our global variables for file saving
fileparams = file.replace('%s_embedding_%s/embedding,'%(filename,fraction),'') # taking the embedding csv filename and repalcing the first part of the string
# explanation
# since multiprocessing does not share variables between instances, if we try to make a global search dataframe (or any structure) each process will create its own copy
# of it and append to it; therefore, at the end we have an empty dataframe because each multiprocess creates its own copy, instead each process will create a file of the lower embedding
# and in the name of the file will be the parameters used to generate this embedding, their is a way to get around this using multiprocessing queues; however, I do not know the lib well enough
fileparams = fileparams.replace('.csv','')
fileparams = fileparams.strip('')
fileparams = fileparams.split(',') # this will leave use with a list of 3 strings, which rep. the 3 UMAP argument used
if os.path.isfile('%s_embedding_%s/figures/Embedding-%s-%s-%s.pdf'%(filename, fraction, fileparams[0],fileparams[1],fileparams[2])) == True: # checking to see if the graph already exists, if so, we return nothing
return None
fig = plt.figure() # this is just creating the embedding data from the file, and saving the graph
plt.title('Embedding-%s-%s-%s'%(fileparams[0],fileparams[1],fileparams[2]))
embedding = genfromtxt(file, delimiter=',', skip_header = 1, usecols = (1,2))
plt.scatter(embedding[:,0], embedding[:,1])
fig.savefig('%s_embedding_%s/figures/Embedding-%s-%s-%s.pdf'%(filename, fraction, fileparams[0],fileparams[1],fileparams[2]))
plt.close()
return None
if __name__ == '__main__': # this is the main execution
fraction = 0.01 # percentage of original dataset used for UMAP and HDBSCAN algorithms, must be between (0 and 1.0]
path = '/home/USER/anaconda3/envs/ML/scripts/ML_IDS_Dashboard/app1/static/input_files/' # CHECK THIS PATH, THIS NEEDS TO BE THE PATH TO THE DATASET
filename = 'Data.csv' # CHECK THIS FILENAME THIS NEED TO MATCH THE DATASET CSV FILENAME, ALSO TAKE NOTE NO LABELS ARE USED
if os.path.exists('%s_embedding_%s'%(filename, fraction)) == False: # checking and creating folders
os.makedirs('%s_embedding_%s'%(filename, fraction))
if os.path.exists('%s_embedding_%s/log_files'%(filename, fraction)) == False:
os.makedirs('%s_embedding_%s/log_files'%(filename, fraction))
if os.path.exists('%s_embedding_%s/figures'%(filename, fraction)) == False:
os.makedirs('%s_embedding_%s/figures'%(filename, fraction))
input_data = data_grab(path,filename,fraction) # calling the datagrab function
procs_UMAP = []
n_neighbors = [15,20,40] # these parameters will be looped through
min_dist = [0.01,0.05] # these parameters will be looped through
metric_iso = ['correlation','euclidean','manhattan'] # these parameters will be looped through, feel free to add more if you want to check more
procs_savefig = []
for index, a in enumerate(n_neighbors): # these loops will create , 'processes' just instance calls of the function (refered to as targets)
for index, b in enumerate(min_dist):
for index, c in enumerate(metric_iso):
if os.path.isfile('%s_embedding_%s/embedding,%s,%s,%s.csv' %(filename,fraction,a,b,c)) == True:
continue
proc = Process(target=UMAP, args=(a, b, c, input_data))
procs_UMAP.append(proc) # appending all of our processes to the list
proc.start() # start the process
for proc in procs_UMAP:
proc.join() # join all the processes currently in the list, this is looped over as it will connect new processes just appended to procs_UMAP to ones already started
filenames = glob.glob('%s_embedding_%s/*.csv'%(filename, fraction)) # looping through the files in a directory
#data = np.random.rand(1260,4)
for d in filenames: # tried to use multiprocessing on savefig function; however, the matplotlib package has some sort of caching that prevents saving multiple figures in a multiprocess
#strfile = str(d)
#proc = Process(target=savefig, args=(strfile,))
#procs_savefig.append(proc)
#proc.start()
savefig(d)
#for proc in procs_savefig:
#proc.join()
I have been playing around with testing various combinations of UMAP with multi processing in python, feel free to use this script if you want.
The code looks readable, except for the numba decorator, which is black magic for me.
I guess I will just end up comparing your latest release vs. just loading the data to GPU and optimizing using the same loss function.
This has been resolved (or should be resolved) as of v0.2.0 (i.e. the branch got merged)
Downloaded the latest release via
wget https://github.com/lmcinnes/umap/archive/master.zip
unzip master.zip
rm master.zip
cd umap-master
sudo pip install -r requirements.txt
python setup.py install
Used these settings
import umap
umap = umap.UMAP(
n_neighbors=5,
n_components=2,
metric='euclidean',
init='random'
)
umap_vectors = umap.fit_transform(pca_vectors_processed)
Ran these benchmarks
I guess there is some stochasticity in this process I will try just looking at the blobs a bit later
Great job guys! Looks like even this version of UMAP is really useful on top of PCA.
Note that those are the gradients of the loss encoded by hand in that SGD. James Melville has a nice piece on UMAP loss here.
As a side note I would highly recommend looking into datashader as a means to visualise the results rather than relying on KDEs.
As a side note I would highly recommend looking into datashader as a means to visualise the results rather than relying on KDEs.
Yup, datashader turned out to be the most inspiring and underrated visualization tool I ever saw. In the end I ended up with doing this 5000-dimension CNN vectors => 15-dimension PCA => 2-dimension UMAP vectors => HDBSCAN for naive clustering
The results are amazing:
LargeVis paper which this bears some similarity to. Note that those are the gradients of the loss encoded by hand in that SGD. James Melville has a nice piece on UMAP loss here.
Read the paper and a note, looks very interesting. Looks like I made a wrong assumption - since the task is unsupervised learning, autograd will not help - since we have to annotated data.
I guess, if so, then porting to GPU may give only negligible performance boost. I hope I will try porting your code into PyTorch directly when I find time.
Of note is the fact that you can embed into higher dimensional spaces than just 2 -- if your goal is clustering rather than visualization then using a higher dimensional embedding can give the process more degrees of freedom to more accurately represent the data. So, for example, embedding to 4 or 8 dimensions may allow for better clustering. It is, at least, worth experimenting with.
higher dimensional spaces than just 2
Ofc, I just presented 2-dimensional results for ease of interpretation
a custom SGD, so what is there now is essentially one I wrote myself from scratch. Presumably deep learning libraries that make use of GPUs can be suitably coerced into doing the right kind of optimization
One nice thing about deep learning libraries is that they're declarative, and will do automatic differentiation. This only requires computing the loss for the input variables, then calling optimizer.step()
. Surprisingly, PyTorch is faster than NumPy when hand-coding the gradients. I think this is because of the graph it defines, so it can be parallelized and the IO is optimized. I walk through benchmarks/more motivation in a blog post.
Out-of-core computation (https://github.com/lmcinnes/umap/issues/62) + deep learning libraries would be easiest with a Dask parameter server (https://github.com/dask/dask-ml/issues/171).
@snakers4 Any luck porting to pytorch? Would be interested in this.
So the good news is that Nvidia will have a GPU accelerated implementation of UMAP in their Rapids platform soon, and I believe most of the theoretical hurdles to multi-core CPU implementation have been resolved, and I hope the 0.5 release of UMAP will include multi-core/threaded support. So while we might not quite be there yet, we are certainly getting closer.
So the good news is that Nvidia will have a GPU accelerated implementation of UMAP in their Rapids platform soon
Cool! Are you referencing the implementation at cuML/src/umap@993eb6? It looks like it got merged through https://github.com/rapidsai/cuml/pull/261.
will include multi-core/threaded support
This is through https://github.com/lmcinnes/pynndescent/pull/12, correct?
Yes, and yes. With the latter I have also come up with a decent multithreaded RP-tree implementation in numba (parallelising over trees, and also within trees, and ideally dask extensible as well). I thought more about the dask extension of lmcinnes/pynndescent#12 https://github.com/lmcinnes/pynndescent/pull/12 and there are a few more hurdles to be overcome to make it support distributed arrays, but I believe Matt Rocklin suggested some potential solutions that can be put in place. The first thing I have to do is get 0.4 out the door so I can start working on these features for 0.5.
On Tue, Mar 19, 2019 at 7:00 PM Scott Sievert notifications@github.com wrote:
So the good news is that Nvidia will have a GPU accelerated implementation of UMAP in their Rapids platform soon
Cool! Are you referencing the implementation at cuML/src/umap@993eb6 https://github.com/rapidsai/cuml/tree/993eb6b1c7a78a30ee32479962562292597b1be8/cuML/src/umap? It looks like it got merged through rapidsai/cuml#261 https://github.com/rapidsai/cuml/pull/261.
will include multi-core/threaded support
This is through lmcinnes/pynndescent#12 https://github.com/lmcinnes/pynndescent/pull/12, correct?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/lmcinnes/umap/issues/37#issuecomment-474617161, or mute the thread https://github.com/notifications/unsubscribe-auth/ALaKBZxGdDJ7gjzmIJkeGTjR27deBLnvks5vYWwfgaJpZM4Rt33Q .
Hi,
Do you have any plan for barnes-hut approximation using quad/oct-tree for 2D/3D rather than RP-tree? It would be great to know. Thank you for awesome work!
@khaled-rahman: The RP-Trees are in support of NN-Descent for (approximate) nearest neighbor computation, for which Barnes-Hut is not applicable. Where t-SNE uses Barnes-Hut is in the optimization step, and in contrast UMAP uses an approximation via negative sampling based approach to get an O(N) instead of O(N log N) Barnes-Hut algorithm.
@lmcinnes : Thank you for quick response. I am studying your tool. What is the running time of spectral embedding in your code? I see it has dependency on sklearn.manifold which says the following: https://scikit-learn.org/stable/modules/manifold.html
Though it is done only once, doesn't it have dependency on number of data-points? More specifically, if we exclude OPTIMIZEEMBEDDING algorithm from Algorithm 1 (i.e. UMAP algorithm) then what will be the asymptotical running time of Algorithm 1? Isn't it greater than O(n^2) now?
For Spark (PySpark) environments, maybe we could rely on the LSH-based Approximate Nearest Neighbor Search (https://spark.apache.org/docs/latest/ml-features#approximate-nearest-neighbor-search), however it only supports Euclidean and Jaccard distances at the moment.
@lmcinnes Is there an easy way to swap the NN search algorithm (so we can easily plug in a different algorithm such as the one available in Spark)?
@candalfigomoro The NN search is not currently pluggable at the top API level, but it is largely factored out in the code. For now that means you could likely monkey-patch things and define a new umap.umap_.nearest_neighbors
function that does what you want. Down the road I think it is most likely that it will adapt to make use of the KNeighborsTransformer
API/approach in scikit-learn. At that point you would need to wrap whatever ANN solution you have in a suitable transformer. All of that is likely to happen in v0.5 when PyNNDescent becomes a required dependency and will be the default ANN search.
@lmcinnes Thank you for your reply. Another issue that I see is that, even if we implement a distributed NN search, UMAP currently expects a numpy array as input, so "fitting" out-of-memory datasets (e.g. HDFS distributed datasets with billions of rows) could be a problem. Maybe a workaround (although suboptimal) could be to "fit" UMAP on a random subsample of data and then apply a transform() on the remaining data in a massively parallel way. What do you think about this?
If you're able to load your data as Dask DataFrames / Arrays, then the
"distributed transform" part is pretty well handled
by
https://ml.dask.org/modules/generated/dask_ml.wrappers.ParallelPostFit.html#dask_ml.wrappers.ParallelPostFit.
You'd
wrap your UMAP
estimator in ParallelPostFit
before calling fit
. It
doesn't affect training at all, just predict, transform, score, etc.
That said, distributed training would still be cool, just a bunch more work :)
On Fri, Feb 14, 2020 at 2:51 AM candalfigomoro notifications@github.com wrote:
@lmcinnes https://github.com/lmcinnes Thank you for your reply. Another issue that I see is that, even if we implement a distributed NN search, UMAP currently expects a numpy array as input, so "fitting" out-of-memory datasets (e.g. HDFS distributed datasets with billions of rows) could be a problem. Maybe a workaround (although suboptimal) could be to "fit" UMAP on a random subsample of data and then apply a transform() on the remaining data in a massively parallel way. What do you think about this?
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/lmcinnes/umap/issues/37?email_source=notifications&email_token=AAKAOISN2HLLRPZ4W43LCQTRCZLSRA5CNFSM4ENXPXIKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOELYBRII#issuecomment-586160289, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAKAOIQXMK3VXKDYU5O2HCDRCZLSRANCNFSM4ENXPXIA .
@TomAugspurger "Unfortunately", I'm in a Spark (with pyspark) cluster/environment... no Dask for me. My inputs are Spark DataFrames (or RDDs). I can use UDFs to apply transforms, but I don't think there's a straightforward way to support the "fit()" part on out-of-memory datasets.
@candalfigomoro I am currently using Spark cluster (pyspark) too. As you do, the fit of UMAP is done locally on the master node. For the transform I use rdd.map() and pass a partial function (functools) that already has the UMAP trained model. So the latter get shipped to the various nodes in the cluster and the transform is parallelized.Out of curiosity, how did you approach this? Thanks
@candalfigomoro @mik1904 we just recently added a very similar capability to the GPU-accelerated UMAP in RAPIDS cuML, which uses Dask.
https://github.com/rapidsai/cuml/blob/branch-0.14/python/cuml/dask/manifold/umap.py
So far, initial evaluations are showing some very positive results, especially as the inference dataset grows very large.
We are continuing to work on distributing the training as well, however the embarrassingly parallel inference is showing to be very useful thus far.
I am having trouble getting UMAP to flood all available CPUs on a Google VM, in various stages I am about to use 2-4 of the available 32 cores.
Is there anything one can do as of v0.4.x to increase CPU usage?
Installing the latest version of pynndescent should help some, but even then it will depend, to a degree, on the datset itself.
@candalfigomoro I am currently using Spark cluster (pyspark) too. As you do, the fit of UMAP is done locally on the master node. For the transform I use rdd.map() and pass a partial function (functools) that already has the UMAP trained model. So the latter get shipped to the various nodes in the cluster and the transform is parallelized.Out of curiosity, how did you approach this? Thanks
I didn't do it, but I think I'd do something similar to that (maybe I'd use a Pandas UDF instead of rdd.map()).
For training on larger-than-memory datasets, maybe we could try something like this: 1) Use pyspark's LSH-based Approximate Nearest Neighbor Search for NN search, then 2) Use ParametricUMAP with mini-batches of data
But it's not trivial. @lmcinnes Do you think it would make sense?
I think that would make sense. I believe the UMAP implementation in CuML now has some level of support for multi-GPU / distributed computation, so that may also be an option worth investigating.
Sorry for joining the conversation like this but has anyone here used any of the discussed implementations for multi-million sized feature sets yet?
I have used this implementation for a (sparse) dataset of size 100,000,000 x 5,500,000 on a very large memory SMP using 64 cores. Obviously a dense dataset of that size would not be possible due to memory constraints, so you would need a distributed implementation to manage that. In terms of the CuML implementation: it has sparse support coming, so at that point it may scale to very large feature sizes.
@lmcinnes As you may have guessed I have several CPUs and GPUs at hand and I work with high-dimensional data. Now I am benching a 500k 5k => 500k 2 vector vs. PCA (I need a high level clustering to filter my data to feed it further in the pipeline).
So a couple of questions: