UMAP shuffle samples leads to quit different result

YaqiangCao commented 5 years ago

Dear authors, Thanks for presenting quite usefule UMAP and the nice documention. I have some samples from cell single and processed them into binary matrix (M,N), where M is the number of cells and N is the number of features. There are different kinds (3) cells in the groups, originally sorted by labels like A_1,A_2,A_3,...B_1,B_2,B_3,...,C_1,C_2,C_3... I first tried feed the matrix to UMAP with parameters as umap.UMAP(n_neighbors=30,n_components=2,metric="manhattan",random_state=123,n_epochs=500).fit_transform(mat). UMAP indeed works very well, the cells are seperated by into observed 3 groups. However, by chance, I shuffled the cell orders in the (M,N) matrix, the projection result is total a mess, with the same parameters. So I wonder if there's key parameters to control the UMAP behavior like this. Thanks very much! Yaqiang Cao

lmcinnes commented 5 years ago

Shuffling can potentially have some impact, since it will vary how the approximate nearest neighbors (which has some randomness) and embedding optimiziation (which, likewise, has some randomness) will perform, but it should significantly change overall qualitative results. Is it possible something else has gone astray?

YaqiangCao commented 5 years ago

Thanks for reply. I just random the samples, all other parameters are exact the same to umap.UMAP(n_neighbors=30,n_components=2,metric="manhattan",random_state=123,n_epochs=500).fit_transform(mat).

Shuffling can potentially have some impact, since it will vary how the approximate nearest neighbors (which has some randomness) and embedding optimiziation (which, likewise, has some randomness) will perform, but it should significantly change overall qualitative results. Is it possible something else has gone astray?

lmcinnes commented 5 years ago

Can you share the data, and ideally the code you used?

YaqiangCao commented 5 years ago

Hi Lmcinnes, Can I email you the code and data ? My email is caoyaqiang0410@gmail.com. The data is a little big. Best, Yaqiang

On Tue, Jul 16, 2019 at 9:46 PM Leland McInnes notifications@github.com wrote:

Can you share the data, and ideally the code you used?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/lmcinnes/umap/issues/268?email_source=notifications&email_token=AAOPQKJJQDX3PCZSJVCOU4DP7Z2Y5A5CNFSM4IEIJER2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD2CYRWI#issuecomment-512067801, or mute the thread https://github.com/notifications/unsubscribe-auth/AAOPQKNKBPRQOQXHFKQYB7LP7Z2Y5ANCNFSM4IEIJERQ .

YaqiangCao commented 5 years ago

Can you share the data, and ideally the code you used?

Dear author, I have prepared the data and jupyter-notebook in the dropbox, please drop me a email at caoyaqiang0410@gmail.com, I will share you the data and code through the links. Thank you for the help. Best, Yaqiang

dewshr commented 4 years ago

@lmcinnes , I am also having the similar problem, subsetting the dataset gives different result

subset_data = pd.read_csv('subset_data.csv', index_col ='name') whole_data = pd.read_csv('whole_data.csv', index_col = 'name')

reducer = umap.UMAP(n_components=5, random_state=0, transform_seed=0) reducer.fit(subset_data)

subset_data_transform = reducer.transform(subset_data) whole_data_transform = reducer.transform(whole_data)

head(subset_data_transform) name,0,1,2,3,4 var_1,-1.0818049,2.1702583,-0.7495309,-1.8568178,-1.0938863 var_2,-0.90896183,4.6964855,-0.5700368,-0.27183083,-0.59852684 var_3,-0.83147764,4.339603,0.06427337,-0.43971997,-1.0240343 var_4,-1.7820284,3.5268242,0.25737917,-1.3918203,-0.60318255 var_5,-0.0651891,2.9304326,-0.3610509,-1.4453144,-0.20545875 var_6,-1.1370527,3.421422,0.23815191,-0.5848139,-1.305079 var_7,-0.9714,2.1284223,-0.4607454,-1.8695027,-0.54702574 var_8,-1.5108252,3.919666,-0.3120325,-0.86461705,0.2995668 var_9,-1.2268223,4.120813,-1.1871228,-0.58820873,-0.15397732

head(whole_data_transform) var_1,-1.3955795,1.7844456,-0.3540998,-1.5740595,-0.42212838 var_2,-0.6275107,5.201565,-0.7708858,-0.2549725,-0.020859836 var_3,-1.362998,4.3281274,0.8444104,-0.46012902,-0.3926726 var_4,-1.8201163,3.9629812,0.62708724,-1.1925513,0.035009027 var_5,-0.20786422,2.0100298,0.17233662,-0.9912149,-0.612357 var_6,-1.160065,3.8599873,0.5915414,0.08912127,-0.864586 var_7,-1.6482177,3.897427,-0.35188386,-1.8948933,0.69750917 var_8,-1.1465372,4.164144,-0.6294845,-1.0734795,1.0632212 var_9,-1.1563764,4.2450128,-0.48862788,0.4416442,-0.10102977 test_data.zip

lmcinnes commented 4 years ago

There is going to be come stochasticity; it mat be that the data is simply small enough that that is quite large.

nico-ebi commented 3 years ago

It is often the case in biological datasets (and I assume other fields as well) to have points ordered by categories. I would strongly recommend adding a default shuffling of the points order (that should have no impact on the performance of the algorithm) as I have noticed a significant bias in some cases as the OP pointed out and it might be strongly misleading (eg. when projecting points using a subset of dimensions that might not have any particular signal in them)

EricKenjiLee commented 3 years ago

I've done a bit of messing around was able to isolate a lot of the stochasticity to the graph construction steps (setting random seed makes the graph layout procedure fully deterministic as seen in the docs). Trying different params I increased negative_sample_rate to about 15 from the default of 5 and that seemed to cut down on the variability quite a bit. I'm applying graph clustering (Louvain) to the network and, for my dataset, the default negative sampling value results in different clusters merging at different permutations of the data---with the larger value (along with the increased epochs [5000 from 500] and lower learning rate [0.25 from 1] as suggested in #158), I've cut down the variability in clustering to low levels acceptable for my purposes. In addition, it seems to make the algorithm more stable even when different random subsets of my full dataset are passed to UMAP.

Very important: the runtimes become much much longer with my parameters in addition to the fact that you also have to set random states everywhere (I have them set for Python's random, Numpy's random, Python's os, for UMAP itself, and for my graph clustering). I have yet to try unsetting the random seeds for SGD to get some performance back; since I cluster on the high-dimensional graph before the layout step, this doesn't affect my results but it definitely matters if you're looking at the projected space.

As Leland has said, there's a tradeoff between performance and determinicity; for my purposes, the slowness was acceptable but for most other's it's probably not. Of course, increased number of samples might help UMAP find structure and thus the variability in clustering becomes more deterministic. I haven't had time to really explore this but this might be something to try that I haven't seen anyone mention before.

lmcinnes / umap

UMAP shuffle samples leads to quit different result #268