kmadathil / sanskrit_parser

Parsers for Sanskrit / संस्कृतम्
MIT License
69 stars 21 forks source link

Slow tests in CI #70

Closed codito closed 6 years ago

codito commented 6 years ago

Test runs in CI are taking ~48mins on latest master. They are timing out for py27.

On running profiler locally, observed a good amount of time is spent in transliteration. Do we need to transliterate from devanagari each time, can we optimize and save the test data as SLP1?

# generate profile data
> python -m cProfile -o profile $(which pytest) tests/test_SanskritLexicalAnalyer.py

# analyze the profile
> python -m pstats profile
Welcome to the profile statistics browser.                                                                                                                     
profile% sort time                                                                                                                                             
profile% stats 10                                                                                                                                              
Thu Dec 14 21:47:07 2017    profile                                                                                                                            

         471264733 function calls (470780348 primitive calls) in **194.862** seconds                                                                               

   Ordered by: internal time                                                                                                                                   
   List reduced from 4211 to 10 due to restriction <10>                                                                                                        

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)                                                                                        
  1832538   **34.757**    0.000  100.051    0.000 /home/arun/src/gh/sanskrit_parser/.venv/lib/python3.6/site-packages/indic_transliteration/sanscript.py:141(__init__)                                                                                                                                                             
    82613   34.130    0.000   79.954    0.001 /home/arun/src/gh/sanskrit_parser/.venv/lib/python3.6/site-packages/networkx/algorithms/simple_paths.py:228(shortest_simple_paths)                                                                                                                                               
150268116   **27.587**    0.000   38.581    0.000 /home/arun/src/gh/sanskrit_parser/.venv/lib/python3.6/site-packages/indic_transliteration/sanscript.py:151(<genexpr>)                                                                                                                                                            
  2879063   16.884    0.000   55.465    0.000 {built-in method builtins.max}                                                                                   
  1832310   **14.374**    0.000   18.223    0.000 /home/arun/src/gh/sanskrit_parser/.venv/lib/python3.6/site-packages/indic_transliteration/sanscript.py:175(_roman)
165318290/165317498   12.610    0.000   12.610    0.000 {built-in method builtins.len}                                                                         
 15996841    7.163    0.000    7.163    0.000 {method 'update' of 'dict' objects}                                                                              
  1832538    5.481    0.000  124.076    0.000 /home/arun/src/gh/sanskrit_parser/.venv/lib/python3.6/site-packages/indic_transliteration/sanscript.py:314(transliterate)                                                                                                                                                        
 46088335    5.309    0.000    5.309    0.000 {method 'add' of 'set' objects}                                                                                  
   646410    4.290    0.000   39.454    0.000 /home/arun/src/gh/sanskrit_parser/.venv/lib/python3.6/site-packages/networkx/algorithms/simple_paths.py:443(_bidirectional_pred_succ)                      
avinashvarna commented 6 years ago

Are we sure that this is really the root cause? I think it may have to do with some dependencies. A while back, when pytest had a new version with a lot of new dependencies, our tests started running slower leading to timeouts, or InvocationError in some cases. I fixed it by rolling back pytest to 3.2.5 which was the known good version. When we added tests for the rest_api, we had to add flask_restplus and flask_cors to the dependencies, and it looks like the slowdown and the InvocationErrors are back again. Looks like your recent test builds confirm that the slowdown and InvocationErrors go away if we remove the dependencies. Not sure how to debug the problem, Does this approach seem worthwhile?

codito commented 6 years ago

The slowness in CI (~24mins to ~48mins) goes away once we remove the flask-cors and flask-restplus dependencies. The root cause for that is not known yet. Is it okay to remove dependencies until we investigate this further? Debug builds may help us, they also publish a docker image we can try out.

Profile numbers shared above do show up without these dependencies as well (tried this on a fast desktop Linux/AMD/x64/12 Core/16G ram). Can you please check if it reproduces for you?

avinashvarna commented 6 years ago

@codito How did you get the test to finish in ~195 seconds? (Did you reduce the number of tests?) On my laptop, the tests take about 20 mins to run and looks like that's the best case on Travis as well.

codito commented 6 years ago

I did a ctrl+c in midway, profiler is wiser to emit data appropriately.

avinashvarna commented 6 years ago

That might be the reason why the transliteration rises to the top. All the transliteration is done at the beginning while creating the reference data. If you had let it run, it would have been 35s out of 20min, which is about 2%. I don't know if we should focus on that right now. I think the bulk of the time is being spent elsewhere.

When debugging the failures after switching to pytest 3.3.x, I had noticed that the 20 slowest tests output by pytest --durations=20, were all just basic tests, but their setup and call time had increased significantly from pytest 3.2 to 3.3. I will try to collect the data again, but the only change in that instance was the pytest version (and associated dependencies).

avinashvarna commented 6 years ago

Is it possible to create a separate test case in tox.ini that will only run test_rest_api with the flask_cors & flask_restplus dependencies specified directly in the tox config? This could be a stop-gap measure while we investigate the problem.

codito commented 6 years ago

Separate env for rest api tests is a great idea. Will try it out.

Re slow performance, transliteration seems to be happening during test execution. Tried few experiments

Command: pytest tests/test_SanskritLexicalAnalyzer.py --test-count x (on py3.6)

Tests count (x) Total execution time Transliteration Graph traversal
5003 501.93s 173.96s [34.65%] 80.44s [16.02%]
1003 79.54s 25.44s [31.98%] 12.381s [15.56%]
103 22.54s 5.69s [25.24%] 4.021s [17.84%]
13 18.35s 4.08s [22.23%] 3.739s [20.38%]

Transliteration stack frames:

Graph traversal stack frames: networkx/algorithms/simple_paths.py:228(shortest_simple_paths)

kmadathil commented 6 years ago

This is quite surprising. I need to drill into the code more, but transcoding shouldn't take up this much time. I suspect it's all the transcoding done in maheshvara_sutras and sandhi that might be doing this. We could avoid this by using the canonical method instead, and modify that to check if the object's encoding is SLP1 and just return the string instead, perhaps.

avinashvarna commented 6 years ago

While making the user friendliness changes, I did debate transliterating the object immediately upon creation into SLP1 and storing it. Then canonical can just return the string itself if denormalization is not needed. This way, we don't have to transliterate each time. I will also take a look at the sandhi class to see if it is doing any unnecessary transliteration.

The __init__ is bothering me. Isn't sanscript imported only once? Why should it be a significant portion of the time, even if we increase the number of tests?

codito commented 6 years ago

The SchemeMap.__init__ was getting called for each invocation of sanscript.transliterate. The len call is expensive iterating through entire map for the input encoding. Call stack is below.

We've fixed it by keeping lru cache for schememaps. PR #74.

 /home/arun/src/gh/sanskrit_parser/sanskrit_parser/lexical_analyzer/SanskritLexicalAnalyzer.py(125)<module>()                                                 
-> class SanskritLexicalAnalyzer(object):                                                                                                                      
  /home/arun/src/gh/sanskrit_parser/sanskrit_parser/lexical_analyzer/SanskritLexicalAnalyzer.py(130)SanskritLexicalAnalyzer()                                  
-> sandhi = Sandhi() # Singleton!                                                                                                                              
  /home/arun/src/gh/sanskrit_parser/sanskrit_parser/lexical_analyzer/sandhi.py(137)__init__()                                                                  
-> self.add_rules_from_dir(os.path.join(base_dir,"sandhi_rules"))                                                                                              
  /home/arun/src/gh/sanskrit_parser/sanskrit_parser/lexical_analyzer/sandhi.py(353)add_rules_from_dir()                                                        
-> self.add_rules_from_file(os.path.join(directory, filename))                                                                                                 
  /home/arun/src/gh/sanskrit_parser/sanskrit_parser/lexical_analyzer/sandhi.py(336)add_rules_from_file()                                                       
-> rule = SanskritObject(line).transcoded(SLP1)                                                                                                                
  /home/arun/src/gh/sanskrit_parser/sanskrit_parser/base/sanskrit_base.py(95)transcoded()                                                                      
-> schema = _get_schema_map(self.encoding, encoding)                                                                                                           
  /home/arun/src/gh/sanskrit_parser/sanskrit_parser/base/sanskrit_base.py(54)_get_schema_map()                                                                 
-> return sanscript.SchemeMap(sanscript.SCHEMES[input_encoding], sanscript.SCHEMES[output_encoding])                                                           
> /home/arun/src/gh/sanskrit_parser/.venv/lib/python3.6/site-packages/indic_transliteration/sanscript.py(151)__init__()                                        
-> self.longest = max(len(x) for g in from_scheme  
kmadathil commented 6 years ago

Tests seem to run in half the time now after the LRU cache and removal of unnecessary transliterate calls (probably more of the former than the latter).

codito commented 6 years ago

Closing this issue since we have taken care of the slowness and rest api tests.