Implementation has exponential runtime (unusable for real-world cases)

matthiaskoenig commented 4 years ago

The LHS implementation in this package has exponential runtime (depending on number of samples). This makes this package unusable if one needs > 20 samples. Just posting this here as a warning if one wants to use this with larger samples sizes which will not work.

From my understanding of the LHS algorithm this should scale with O(n) in number of the samples.

import lhsmdu

def lhsmdu_runtime():
    runtimes = []
    import time
    for k in range(0, 9):
        ts = time.time()
        samples = 2**k
        lhsmdu.sample(2, samples)  # Latin Hypercube Sampling of two variables, and 10 samples each
        te = time.time()
        res = {'samples': samples, 'time': te-ts}
        print(res)
        runtimes.append(
            res
        )
    df = pd.DataFrame(runtimes)
    fig, axes = plt.subplots(nrows=2, ncols=1, figsize=(5, 10))
    for ax in axes:
        ax.plot(df.samples, df.time, '-o', markersize=10)
        ax.set_xlabel("sample size")
        ax.set_ylabel("runtime [s]")
        ax.grid(True)

    axes[1].set_xscale("log")
    axes[1].set_yscale("log")
    fig.savefig("lhsmdu_runtime.png")
    plt.show()

if __name__ == "__main__":
    # example1()
    lhsmdu_runtime()

So basically it takes 5 minutes to create 256 samples, around 1 hr for 500 samples. The current implementation is not usable for more than 20 samples.

{'samples': 1, 'time': 0.0008461475372314453}
{'samples': 2, 'time': 0.0018656253814697266}
{'samples': 4, 'time': 0.007106781005859375}
{'samples': 8, 'time': 0.030531644821166992}
{'samples': 16, 'time': 0.1478424072265625}
{'samples': 32, 'time': 0.8095762729644775}
{'samples': 64, 'time': 5.116601467132568}
{'samples': 128, 'time': 36.78028750419617}
{'samples': 256, 'time': 283.7775731086731}

Process finished with exit code 0

sahilm89 commented 4 years ago

Dear Matthias,

Thank you for your analysis. Indeed, the code was taking too long to complete for large samples. I had been trying to find some time to profile and refactor the code. This is now done. Attached are new plots. I hope the new times make this library useful for you.

As you will notice, the timings have improved quite dramatically. 500 samples now run in about 1 minute. However, the timing remains exponentials in the number of samples. I'm afraid that is going to be so for this algorithm. This is because LHSMDU relies on finding pairwise distances between Monte-Carlo sampled "realizations" in order to form the "strata". This step is ~O(n^2) for large numbers, and hence will lead to an exponent in the Time vs. Number of samples. However, the algorithm is pretty good with an increasing number of dimensions. I am posting a bit more detail about this in the other issues. Having said that, I could now extract 500 samples from lhsmdu in about a minute, and 1000 samples in around 9 minutes, which is hopefully a respectable number for several real-world cases. Please let me know your feedback about this. I will upload the new refactored code tonight.

Cheers, S

lhsmdu_runtime

matthiaskoenig commented 4 years ago

@sahilm89 Thanks so much. This solves my issue.

I also tried https://pythonhosted.org/pyDOE/randomized.html. The pyDOE implementation is much faster, but I have the feeling that the different dimensions are not optimized at the same time. I.e. samples are only LHS sampled in every dimension, but not in the multiple dimensions at the same time.

sahilm89 commented 4 years ago

I'm glad this helped you. In case you end up using the package, I'll be really glad if you could cite it! :)

Cheers, S

sahilm89 / lhsmdu

Implementation has exponential runtime (unusable for real-world cases) #7