Closed ethanluoyc closed 3 years ago
I think it is needed to profile and see what is happening. The program might be bottlenecked by Python's GIL. A quick check you could do is to look at CPU usage as the number of workers increases. If CPU usage with 1 vs more workers is similar it could be GIL.
It seems that the CPU usage does increase with more workers, so might not be GIL.
I did some basic profiling and here is what I got:
Mon Oct 11 09:53:26 2021 script.prof
1778481 function calls (1737841 primitive calls) in 129.245 seconds
Ordered by: cumulative time
List reduced from 9010 to 20 due to restriction <20>
ncalls tottime percall cumtime percall filename:lineno(function)
3674/1 0.032 0.000 129.252 129.252 {built-in method builtins.exec}
1 0.000 0.000 129.252 129.252 test_load_reverb.py:1(<module>)
1 0.000 0.000 127.922 127.922 /home/yicheng/virtualenvs/orax/lib/python3.8/site-packages/absl/app.py:277(run)
1 3.485 3.485 127.888 127.888 /home/yicheng/virtualenvs/orax/lib/python3.8/site-packages/absl/app.py:238(_run_main)
1 0.002 0.002 124.403 124.403 test_load_reverb.py:67(main)
1 0.001 0.001 124.366 124.366 test_load_reverb.py:22(load_dataset_into_reverb)
101 124.358 1.231 124.358 1.231 {method 'acquire' of '_thread.lock' objects}
23 0.000 0.000 124.358 5.407 /usr/lib/python3.8/threading.py:270(wait)
11 0.000 0.000 123.257 11.205 /usr/lib/python3.8/concurrent/futures/_base.py:612(result_iterator)
10 0.000 0.000 123.257 12.326 /usr/lib/python3.8/concurrent/futures/_base.py:416(result)
2933/10 0.007 0.000 1.330 0.133 <frozen importlib._bootstrap>:986(_find_and_load)
2932/10 0.006 0.000 1.330 0.133 <frozen importlib._bootstrap>:956(_find_and_load_unlocked)
2854/10 0.006 0.000 1.329 0.133 <frozen importlib._bootstrap>:650(_load_unlocked)
4878/10 0.001 0.000 1.329 0.133 <frozen importlib._bootstrap>:211(_call_with_frames_removed)
2626/10 0.004 0.000 1.329 0.133 <frozen importlib._bootstrap_external>:842(exec_module)
10246/21 0.006 0.000 1.327 0.063 <frozen importlib._bootstrap>:1017(_handle_fromlist)
2120/9 0.003 0.000 1.327 0.147 {built-in method builtins.__import__}
1 0.000 0.000 1.213 1.213 /home/yicheng/virtualenvs/orax/lib/python3.8/site-packages/acme/adders/reverb/__init__.py:16(<module>)
1 0.000 0.000 1.103 1.103 /usr/lib/python3.8/concurrent/futures/_base.py:583(map)
1 0.000 0.000 1.103 1.103 /usr/lib/python3.8/concurrent/futures/_base.py:608(<listcomp>)
It seems that a lot of time is spent on acquiring the thread lock somewhere but I am not sure where that is happening. Hope this is useful for identifying the issue.
Hard to tell without a stack trace showing where lock is accessed. Can you get it from the profile?
I am not that familiar with cProfile so I am not entirely sure what's happening.
I am attaching the output generated from running python3 -m cProfile -o ...
if that's useful:
script.prof.zip.
The main source of slowness is the map between append and create_item: tree.map_structure(operator.itemgetter(-1), writer.history)
Adding more threads doesn't help as computations run sequentially due to GIL. Actually - more threads make it run slower due to context switches. cProfile you provided shows the main thread waiting on the workers to complete, so it doesn't tell much.
I see. Is there a good workaround to speed up the map structure there without sacrificing generality? (I.e. allowing insertion of arbitrary stacked items)
In terms of the GIL, I do not have a clue of how to solve that. Maybe it's possible to write some Cython code that releases the GIL when running the for loop. In that case, is the client thread safe?
Easiest way is to use multiple processes to write data to Reverb.
You are probably right. I will go with that then. Thanks!
Hi, This is a follow-up from #72
I have done some profiling using the suggested solution from #72, however, I found that there is no speed-up using the suggested solution.
Here is complete repro script, I have modified a few things from the answer in #72 (including some fixes to the solution as well as making the script self-contained)
It seems that the total time it takes for the loading to finish is the same whether I am using 10 workers or 1 worker. I am not sure what's happening, could this be an issue with Python's GIL?