lhsmdu crash for large samples (500, 1000) or/and multiple variables (10, 50, 100)

MatthiVH commented 4 years ago

Hi there,

I was wondering what caused the lhsmdu-function to crash sometimes for larger samples or/and multiple variables. Other times is works but takes a long time (e.g., 2 variables, 500 samples in 20min. times). see: https://stackoverflow.com/questions/26137195/latin-hypercube-sampling-with-python/59049597#59049597

Kind regards

sahilm89 commented 4 years ago

Hi @MatthiVH I've been meaning to profile the code and refactor if necessary. I can work on it over this weekend and see if some memory can be saved or timing improved.

sahilm89 commented 4 years ago

Hi @MatthiVH

Okay, I finally found some time and refactored the code. It runs much faster now. I could get 500 samples in about 1 minute and 1000 samples in around 9 minutes. As you can see in the plots below, the time taken to sample is still exponential in the number of samples (by nature of the algorithm itself), but doesn't change much with the dimensions sampled from. Please check out the most recent update for this and let me know if this is useful!

Runtimes_dims=2 Runtimes_dims=4 Runtimes_dims=8 Runtimes_dims=16 Runtimes_dims=32

Best, S

MatthiVH commented 4 years ago

Hi S.,

This is a very big improvement and much more useful for real cases now, this is great! I'm only wondering why the time doesn't go up with the dimensions and only with the amount of samples? When taking latin hypercube in 2 dimensions, it's just a square, so that's fairly easy I guess but the more variables, the more more dimensions and then it becomes quite complex and time consuming I would think? But your results look cool, I'm only wondering why the time doesn't go up with the amount of dimensions.

Kind regards, Matthias

sahilm89 commented 4 years ago

Hi Matthias,

The reason is that the algorithm involves an elimination process where you basically drop a bunch of random numbers that are nearby to form the strata. This requires an exhaustive pairwise distance comparison (that is n choose 2), which is about n^2, where n is an integer multiple of the number of samples required. Thus the algorithm is ~O(n^2) in samples. Larger dimensions lead to larger sampling matrices, but firstly, the algorithm complexity increases only linearly with dimensions (for the same number of samples), and second, the matrix operations are much faster than running loops. Even for a moderate number of samples and a realistic number of dimensions, the complexity is completely dominated by the number of samples.

The cool thing although is that even if the number of samples is low, the LHS produces random numbers that are much better estimates of the true distribution. I'm adding some examples comparing NumPy random uniform sampling with LHS for a range of sample numbers.

I hope this clarifies the doubt! Cheers, Sahil

sahilm89 commented 4 years ago

Hi M,

Let me know if you have other questions. Otherwise, shall I close this issue?

Cheers, S

MatthiVH commented 4 years ago

Hi S.,

No, this issue is clarified. If I would have other questions, while working with the package, I'll just open a new issue. Thanks!

Cheers, M

sahilm89 commented 4 years ago

Cool, closing issue. Also, in case you use the repo for your work, I'd be very glad if you cite it :) Cheers, S

MatthiVH commented 4 years ago

Nope problem, I will do that. :)

sahilm89 / lhsmdu

lhsmdu crash for large samples (500, 1000) or/and multiple variables (10, 50, 100) #5