Optimize string repetition and string copying

advikkabra commented 6 months ago

Fixes #2616

The current string repetition algorithm was inefficient, where memset could be used for single char strings, and binary exponentiation could be used for multi char strings. This is faster than copying one byte at a time.
_lfortran_strcpy was calling strlen in every loop iteration, which was slowing down the program significantly.

After this change, a 1e6 repetition as shown in the issue is 0.04s, which is comparable to CPython.

kmr-srbh commented 6 months ago

Advik, I recognize and appreciate the effort you have put in to find and resolve this issue. I fear, we have increased the time instead.

N = 10 ⁶

string1: str = "a"
string2: str = string1 * 10**6

(base) saurabh-kumar@Awadh:~/Projects/System/lpython$ time ./src/bin/lpython ./examples/example.py
^C
real    1m43.633s
user    1m43.501s
sys     0m0.048s

The operation took around 30s before. I request you to look into it.

kmr-srbh commented 6 months ago

I think you might have mistakenly tested the above using CPython.

advikkabra commented 6 months ago

It works fine on my machine, and after profiling more than 99% of the time used was in the strlen calls in the strcpy function.

advik@advik-Inspiron-7560:~/dev/oss/lpython$ time src/bin/lpython ../test.py
1000000

real    0m0.082s
user    0m0.058s
sys     0m0.020s
advik@advik-Inspiron-7560:~/dev/oss/lpython$ cat ../test.py
string1: str = "a";
string2: str = string1 * 10**6
print(len(string2))

kmr-srbh commented 6 months ago

Great work Advik! It works properly on my machine now too! That's an outright 300x improvement! :clap:

(lp) saurabh-kumar@Awadh:~/Projects/System/lpython$ time ./src/bin/lpython ./examples/example.py

real    0m0.110s
user    0m0.071s
sys     0m0.039s

The reason it did not work the previous time was because I was on an older branch. The new changes from LFortran have done an amazing job with increasing the speed.

Shaikh-Ubaid commented 6 months ago

It does not fail when s_len == 1, but memset is very optimized for this usecase.

Did you trying checking this if the else works for s_len == 1? It is fine to have a separate case for s_len == 1, but I just want to be sure that the else works perfectly (for example it works for odd numbers like 1).

Shaikh-Ubaid commented 6 months ago

Let's merge this after we are sure that the else case works for odd numbers like 1.

advikkabra commented 6 months ago

This test is already there in test_str_01.py in integration_tests.

def test_str_repeat():
    a: str
    a = "Xyz"
    assert a*3 == "XyzXyzXyz"
    assert a*2*3 == "XyzXyzXyzXyzXyzXyz"
    assert 3*a*3 == "XyzXyzXyzXyzXyzXyzXyzXyzXyz"
    assert a*-1 == ""

Shaikh-Ubaid commented 6 months ago

This test is already there in test_str_01.py in integration_tests.

Do we have a test with a large n like (10^6 or so)? If so, then it is good.

          I think we should add a test for this with a large `n` (so that we are sure that the time taken does not break in future). We can do it in a separate PR.

Originally posted by @Shaikh-Ubaid in https://github.com/lcompilers/lpython/pull/2634#pullrequestreview-1975689401

advikkabra commented 6 months ago

How would you suggest I add this as a test? Is there a way to check if something is taking too long?

Shaikh-Ubaid commented 6 months ago

How would you suggest I add this as a test? Is there a way to check if something is taking too long?

I think not yet in LPython (for LFortran we have cpu_time). For now, by adding a test we are just trying to ensure that the test is not taking too long.

Previously (as of the main branch) the time complexity of _lfortran_strcpy() was O(n^2). That means, for n = 10^6 the time taken could be in orders of n^12, which would never practically complete at the CI.

If we add a test for n = 10^6 and the test completes in a finite time (I am assuming it will take < 1s. If it take more than 1s then it is not much helpful and we should reduce n), then it verifies/ensures that the time complexity of _lfortran_strcpy() has improved.

Shaikh-Ubaid commented 6 months ago

For a robust test, we should support something like time.time(). But it should not be part of this PR and can be done later.

Once we have support for time.time(), we can later come back to the above n = 10^6 test case and update it to add a check for the time taken.

advikkabra commented 6 months ago

I have added a test for a 10**6 repetition. This would take >1 minute in the previous code, so it would be easy to catch if it goes wrong. Right now it completes it in 20ms on my machine, so it is comfortable with the new implementation. I think the PR is ready now.

lcompilers / lpython