ValueError on large datasets

ker2xu commented 9 months ago

The error is raised by the 159th line in model_selection.py. pi, log = partial_pairwise_align_given_cost_matrix(sliceA, sliceB, s=m, M=M, alpha=alpha, armijo=False, norm=True, return_obj=True, verbose=False) when evaluating m=0.99.

ValueError: Error in the EMD resolution: try to increase the number of dummy points

How can I crease the number? What are dummy points and which variable denote it?

x-h-liu commented 9 months ago

If you rerun the program, does the error persist? It might be just a numerical issue due to randomness.

If the error persists, there is a variable nb_dummies on line 112 of PASTE2.py. Try to increase it from 1 and see what happens. I remember I had seen this error before but restarting the program solved the problem for me.

ker2xu commented 9 months ago

I tried to increase nb_dummies from 1 to 2, 4, 8 ,16, 32. All of them did not work and when I used 16 or 32, the kernel of jupyter notebook would even crash. I was trying to combine "E16.5_E2S1.MOSTA", "E16.5_E2S2.MOSTA" from [https://db.cngb.org/stomics/mosta/] What should be done to fix this?

PASTE2.partial_pairwise_align(sliceA, sliceB, s=s)

PASTE2 starts... Starting GLM-PCA... Iteration: 0 | deviance=1.6926E+8 Iteration: 1 | deviance=1.6926E+8 Iteration: 2 | deviance=1.6079E+8 Iteration: 3 | deviance=1.5757E+8 Iteration: 4 | deviance=1.5632E+8 Iteration: 5 | deviance=1.5563E+8 Iteration: 6 | deviance=1.5519E+8 Iteration: 7 | deviance=1.5488E+8 Iteration: 8 | deviance=1.5464E+8 Iteration: 9 | deviance=1.5445E+8 Iteration: 10 | deviance=1.5430E+8 Iteration: 11 | deviance=1.5417E+8 Iteration: 12 | deviance=1.5407E+8 Iteration: 13 | deviance=1.5397E+8 Iteration: 14 | deviance=1.5389E+8 Iteration: 15 | deviance=1.5382E+8 Iteration: 16 | deviance=1.5375E+8 Iteration: 17 | deviance=1.5369E+8 Iteration: 18 | deviance=1.5364E+8 Iteration: 19 | deviance=1.5359E+8 Iteration: 20 | deviance=1.5354E+8 Iteration: 21 | deviance=1.5350E+8 Iteration: 22 | deviance=1.5346E+8 Iteration: 23 | deviance=1.5342E+8 Iteration: 24 | deviance=1.5339E+8 Iteration: 25 | deviance=1.5335E+8 Iteration: 26 | deviance=1.5332E+8 Iteration: 27 | deviance=1.5329E+8 Iteration: 28 | deviance=1.5327E+8 Iteration: 29 | deviance=1.5324E+8 Iteration: 30 | deviance=1.5322E+8 Iteration: 31 | deviance=1.5319E+8 Iteration: 32 | deviance=1.5317E+8 Iteration: 33 | deviance=1.5315E+8 Iteration: 34 | deviance=1.5313E+8 Iteration: 35 | deviance=1.5311E+8 Iteration: 36 | deviance=1.5309E+8 Iteration: 37 | deviance=1.5308E+8 Iteration: 38 | deviance=1.5306E+8 Iteration: 39 | deviance=1.5304E+8 GLM-PCA finished. It. |Loss |Relative loss|Absolute loss

0|5.239569e+03|0.000000e+00|0.000000e+00

RESULT MIGHT BE INACURATE Max number of iteration reached, currently 1000000. Sometimes iterations go on in cycle even though the solution has been reached, to check if it's the case here have a look at the minimal reduced cost. If it is very close to machine precision, you might actually have the correct solution, if not try setting the maximum number of iterations a bit higher

x-h-liu commented 8 months ago

When you increase nb_dummies, do you still see the "ValueError: Error in the EMD resolution: try to increase the number of dummy points", or is it just "RESULT MIGHT BE INACURATE"? If it is just "RESULT MIGHT BE INACURATE", this is not an error so you can either ignore it and let the program continue to run, or you can increase numItermax on line 140 of PASTE2.py.

ker2xu commented 8 months ago

However, the jupyter kernel will crash after the results above are displayed. I cannot continue to run the program.

x-h-liu commented 8 months ago

This is because the data matrices are too large. E2S1 has ~60,000 cells and E2S2 has ~70,000 cells. The alignment matrix will then have > 60,000 * 70,000 entries, which is too large for any computer memory so the kernel will crash. For now, the only optimal transport method that scales to data of this size is moscot. They offer a low-rank representation of the alignment matrix which reduces memory. If you want to align slices of this large size you can check their method out.

ker2xu commented 8 months ago

Thanks! It is a pity that paste2 cannot be applied on large datasets.

raphael-group / paste2

ValueError on large datasets #2