Closed codito closed 6 years ago
Are we sure that this is really the root cause? I think it may have to do with some dependencies. A while back, when pytest had a new version with a lot of new dependencies, our tests started running slower leading to timeouts, or InvocationError in some cases. I fixed it by rolling back pytest to 3.2.5 which was the known good version. When we added tests for the rest_api, we had to add flask_restplus and flask_cors to the dependencies, and it looks like the slowdown and the InvocationErrors are back again. Looks like your recent test builds confirm that the slowdown and InvocationErrors go away if we remove the dependencies. Not sure how to debug the problem, Does this approach seem worthwhile?
The slowness in CI (~24mins to ~48mins) goes away once we remove the flask-cors and flask-restplus dependencies. The root cause for that is not known yet. Is it okay to remove dependencies until we investigate this further? Debug builds may help us, they also publish a docker image we can try out.
Profile numbers shared above do show up without these dependencies as well (tried this on a fast desktop Linux/AMD/x64/12 Core/16G ram). Can you please check if it reproduces for you?
@codito How did you get the test to finish in ~195 seconds? (Did you reduce the number of tests?) On my laptop, the tests take about 20 mins to run and looks like that's the best case on Travis as well.
I did a ctrl+c in midway, profiler is wiser to emit data appropriately.
That might be the reason why the transliteration rises to the top. All the transliteration is done at the beginning while creating the reference data. If you had let it run, it would have been 35s out of 20min, which is about 2%. I don't know if we should focus on that right now. I think the bulk of the time is being spent elsewhere.
When debugging the failures after switching to pytest 3.3.x, I had noticed that the 20 slowest tests output by pytest --durations=20
, were all just basic tests, but their setup and call time had increased significantly from pytest 3.2 to 3.3. I will try to collect the data again, but the only change in that instance was the pytest version (and associated dependencies).
Is it possible to create a separate test case in tox.ini that will only run test_rest_api with the flask_cors & flask_restplus dependencies specified directly in the tox config? This could be a stop-gap measure while we investigate the problem.
Separate env for rest api tests is a great idea. Will try it out.
Re slow performance, transliteration seems to be happening during test execution. Tried few experiments
pytest tests/test_SanskritLexicalAnalyzer.py --collect-only
doesn't show traces of transliteration calls. Duration is ~5.52sCommand: pytest tests/test_SanskritLexicalAnalyzer.py --test-count x
(on py3.6)
Tests count (x) | Total execution time | Transliteration | Graph traversal |
---|---|---|---|
5003 | 501.93s | 173.96s [34.65%] | 80.44s [16.02%] |
1003 | 79.54s | 25.44s [31.98%] | 12.381s [15.56%] |
103 | 22.54s | 5.69s [25.24%] | 4.021s [17.84%] |
13 | 18.35s | 4.08s [22.23%] | 3.739s [20.38%] |
Transliteration stack frames:
indic_transliteration/sanscript.py:140(__init__)
indic_transliteration/sanscript.py:150(<genexpr>)
Graph traversal stack frames: networkx/algorithms/simple_paths.py:228(shortest_simple_paths)
This is quite surprising. I need to drill into the code more, but transcoding shouldn't take up this much time. I suspect it's all the transcoding done in maheshvara_sutras and sandhi that might be doing this. We could avoid this by using the canonical method instead, and modify that to check if the object's encoding is SLP1 and just return the string instead, perhaps.
While making the user friendliness changes, I did debate transliterating the object immediately upon creation into SLP1 and storing it. Then canonical can just return the string itself if denormalization is not needed. This way, we don't have to transliterate each time. I will also take a look at the sandhi class to see if it is doing any unnecessary transliteration.
The __init__
is bothering me. Isn't sanscript imported only once? Why should it be a significant portion of the time, even if we increase the number of tests?
The SchemeMap.__init__
was getting called for each invocation of sanscript.transliterate
. The len
call is expensive iterating through entire map for the input encoding. Call stack is below.
We've fixed it by keeping lru cache for schememaps. PR #74.
/home/arun/src/gh/sanskrit_parser/sanskrit_parser/lexical_analyzer/SanskritLexicalAnalyzer.py(125)<module>()
-> class SanskritLexicalAnalyzer(object):
/home/arun/src/gh/sanskrit_parser/sanskrit_parser/lexical_analyzer/SanskritLexicalAnalyzer.py(130)SanskritLexicalAnalyzer()
-> sandhi = Sandhi() # Singleton!
/home/arun/src/gh/sanskrit_parser/sanskrit_parser/lexical_analyzer/sandhi.py(137)__init__()
-> self.add_rules_from_dir(os.path.join(base_dir,"sandhi_rules"))
/home/arun/src/gh/sanskrit_parser/sanskrit_parser/lexical_analyzer/sandhi.py(353)add_rules_from_dir()
-> self.add_rules_from_file(os.path.join(directory, filename))
/home/arun/src/gh/sanskrit_parser/sanskrit_parser/lexical_analyzer/sandhi.py(336)add_rules_from_file()
-> rule = SanskritObject(line).transcoded(SLP1)
/home/arun/src/gh/sanskrit_parser/sanskrit_parser/base/sanskrit_base.py(95)transcoded()
-> schema = _get_schema_map(self.encoding, encoding)
/home/arun/src/gh/sanskrit_parser/sanskrit_parser/base/sanskrit_base.py(54)_get_schema_map()
-> return sanscript.SchemeMap(sanscript.SCHEMES[input_encoding], sanscript.SCHEMES[output_encoding])
> /home/arun/src/gh/sanskrit_parser/.venv/lib/python3.6/site-packages/indic_transliteration/sanscript.py(151)__init__()
-> self.longest = max(len(x) for g in from_scheme
Tests seem to run in half the time now after the LRU cache and removal of unnecessary transliterate calls (probably more of the former than the latter).
Closing this issue since we have taken care of the slowness and rest api tests.
Test runs in CI are taking ~48mins on latest master. They are timing out for py27.
On running profiler locally, observed a good amount of time is spent in transliteration. Do we need to transliterate from devanagari each time, can we optimize and save the test data as SLP1?