kengz / SLM-Lab

Modular Deep Reinforcement Learning framework in PyTorch. Companion library of the book "Foundations of Deep Reinforcement Learning".
https://slm-lab.gitbook.io/slm-lab/
MIT License
1.23k stars 263 forks source link

ERROR:buffer_manager.cc(488)] [.DisplayCompositor]GL ERROR :GL_INVALID_OPERATION : glBufferData: <- error from previous GL command #465

Closed Nick-Kou closed 2 years ago

Nick-Kou commented 3 years ago

Describe the bug After successfully installing SLM-Lab and proceeding to the "Quick Start" portion which involves running DQN on the CartPole environment, everything works well i.e. (final_return_ma increases).

Command entered: python run_lab.py slm_lab/spec/demo.json dqn_cartpole dev

After several log summary and metric instances an OpenGL error code occurs :

[101017:1015/191313.594764:ERROR:buffer_manager.cc(488)] [.DisplayCompositor]GL ERROR :GL_INVALID_OPERATION : glBufferData: <- error from previous GL command

and then the process seems to end without showing any graphs.

To Reproduce

  1. OS and environment: Ubuntu 20.04 LTS

  2. SLM Lab git SHA (run git rev-parse HEAD to get it):dda02d00031553aeda4c49c5baa7d0706c53996b

  3. spec file used: slm_lab/spec/demo.json

Error logs

[2020-10-15 19:13:09,800 PID:100781 INFO __init__.py log_summary] Trial 0 session 0 dqn_cartpole_t0_s0 [train_df] epi: 123  t: 120  wall_t: 153  opt_step: 398720  frame: 10000  fps: 65.3595  total_reward: 200  total_reward_ma: 142.7  loss: 5.46846  lr: 0.00774841  explore_var: 0.1  entropy_coef: nan  entropy: nan  grad_norm: 0.230459
[2020-10-15 19:13:09,821 PID:100781 INFO __init__.py log_metrics] Trial 0 session 0 dqn_cartpole_t0_s0 [train_df metrics] final_return_ma: 142.7  strength: 120.84  max_strength: 178.14  final_strength: 178.14  sample_efficiency: 0.00019783  training_efficiency: 5.02079e-06  stability: 0.926742
[100946:1015/191310.923076:ERROR:buffer_manager.cc(488)] [.DisplayCompositor]GL ERROR :GL_INVALID_OPERATION : glBufferData: <- error from previous GL command
[2020-10-15 19:13:12,794 PID:100781 INFO __init__.py log_metrics] Trial 0 session 0 dqn_cartpole_t0_s0 [eval_df metrics] final_return_ma: 142.7  strength: 120.84  max_strength: 178.14  final_strength: 178.14  sample_efficiency: 0.00019783  training_efficiency: 5.02079e-06  stability: 0.926742
[2020-10-15 19:13:12,798 PID:100781 INFO logger.py info] Session 0 done
[101017:1015/191313.594764:ERROR:buffer_manager.cc(488)] [.DisplayCompositor]GL ERROR :GL_INVALID_OPERATION : glBufferData: <- error from previous GL command
[2020-10-15 19:13:15,443 PID:100781 INFO logger.py info] Trial 0 done
kengz commented 3 years ago

Hi @Nick-Kou seems like this is caused by OpenGL rendering using GPU. I looked up a few instances of this problem, and this offers a potential solution: https://stackoverflow.com/a/58157803

Nick-Kou commented 3 years ago

After trying that potential solution, the same OpenGL errors still occurred as usual. I attempted to try this solution : https://www.reddit.com/r/Crostini/comments/f0g9d3/disable_gpu_acceleration_per_app/. By entering the command: export LIBGL_ALWAYS_SOFTWARE=1 everything started in that terminal will run without hardware acceleration. Unfortunately, still no luck as the same errors occurred.

kengz commented 3 years ago

Sorry it didn't work. The error is quite unspecific, but judging from where it occurs (between session-end and trial-end), there aren't many things happening, and one of them that's possible OpenGL related is saving graphs. So, I wonder if it's related to Plotly's backend. Could you check by short-circuiting this method here (on your local): https://github.com/kengz/SLM-Lab/blob/dda02d00031553aeda4c49c5baa7d0706c53996b/slm_lab/lib/viz.py#L117-L124

and replace the body with just

def save_image(figure, filepath): 
     return
Nick-Kou commented 3 years ago

Thank you! This has solved the problem. I am currently starting to use and learn more about SLM Lab and am quite unfamiliar at the moment. I was wondering what this function exactly does, and if it is very important, how could go about implementing an equivalent without producing errors?

kengz commented 3 years ago

The method saves generated Plotly plots into image files. Seems like it's due to the Plotly backend for writing image, known as plotly-orca. The same issue shows up here:

Let's try something simple: can u update the plotly orca package:

conda activate lab
conda update plotly-orca
Nick-Kou commented 3 years ago

Thank you once again. However, after updating plotly-orca, the same errors occur.

kengz commented 3 years ago

Alright, the next thing is to test plotly-orca directly. Could you run through some of the Quick Start example from their repo: https://github.com/plotly/orca#quick-start In particular these:

  1. Directly in the terminal. Note you need to activate Conda so you'd have the orca command

    conda activate lab
    orca graph '{ "data": [{"y": [1,2,1]}] }' -o fig.png
  2. From python, again run with Conda activated. Make and call the following Python script.

    
    from subprocess import call
    import json
    import plotly

fig = {"data": [{"y": [1,2,1]}]} call(['orca', 'graph', json.dumps(fig, cls=plotly.utils.PlotlyJSONEncoder)])



If the issue is confirmed as caused by orca, we can open an issue on there.
Nick-Kou commented 3 years ago

Results from first test: Test 1

Results from second test: Test 2

I think this issue is confirmed as caused by orca.

kengz commented 3 years ago

Thanks for confirming @Nick-Kou. Could you open an issue on their repo and link to this one, with these errors and how you produced them, and maybe your OS and orca versions as well. Let's see how they can fix the issue.

Nick-Kou commented 3 years ago

Sounds good. An issue was opened. Thanks once again.

plotly/orca#352

kengz commented 2 years ago

501 replaces orca with kaleido for Plotly, please use the latest release v4.2.4. Closing this issue as it's old; feel free to reopen if the same issue comes up again with kaleido.