matthieu637 / ddrl

Deep Developmental Reinforcement Learning
MIT License
29 stars 3 forks source link

perf on FetchPush-v1 #8

Closed huangjiancong1 closed 5 years ago

huangjiancong1 commented 5 years ago

It seems that the last performance hard to increased when using FetchPush-v1, see the bottom of the output

main algo : PeNFAC(lambda)-V
episode 0 total steps 0 last perf 0
#INFO :/home/jim/ddrl/agent/cacla/src/pybinding/nfac.cpp.85 : L 0     50     -50.000000 -39.4993932862 -50.0000000000 0.00000  0.20000 0 0.000 49.710 82.538
episode 100 total steps 5000 last perf 0.0
#INFO :/home/jim/ddrl/agent/cacla/src/pybinding/nfac.cpp.85 : L 100   50     -50.000 -39.4993932862 -50.0000000000 305.25273 0.20000 250 0.416 742.022 1764.851
#INFO :/home/jim/ddrl/agent/cacla/src/pybinding/nfac.cpp.85 : T 100   50     -50.000 -39.4993932862 -50.0000000000 305.25273 0.20000 250 0.416 742.022 1764.851
episode 200 total steps 10000 last perf -50.0
#INFO :/home/jim/ddrl/agent/cacla/src/pybinding/nfac.cpp.85 : L 200   50     -50.000 -39.4993932862 -50.0000000000 439.32512 0.20000 250 0.300 874.353 3509.462
#INFO :/home/jim/ddrl/agent/cacla/src/pybinding/nfac.cpp.85 : T 200   50     -50.000 -39.4993932862 -50.0000000000 439.32512 0.20000 250 0.300 874.353 3509.462
episode 300 total steps 15000 last perf -50.0
#INFO :/home/jim/ddrl/agent/cacla/src/pybinding/nfac.cpp.85 : L 300   50     -50.000 -39.4993932862 -50.0000000000 405.47166 0.20000 250 0.296 991.178 3248.775
#INFO :/home/jim/ddrl/agent/cacla/src/pybinding/nfac.cpp.85 : T 300   50     -50.000 -39.4993932862 -50.0000000000 405.47166 0.20000 250 0.296 991.178 3248.775
episode 400 total steps 20000 last perf -50.0
#INFO :/home/jim/ddrl/agent/cacla/src/pybinding/nfac.cpp.85 : L 400   50     -50.000 -39.4993932862 -50.0000000000 288.83902 0.20000 250 0.408 1073.500 2395.008
#INFO :/home/jim/ddrl/agent/cacla/src/pybinding/nfac.cpp.85 : T 400   50     -50.000 -39.4993932862 -50.0000000000 288.83902 0.20000 250 0.408 1073.500 2395.008
episode 500 total steps 25000 last perf -50.0
#INFO :/home/jim/ddrl/agent/cacla/src/pybinding/nfac.cpp.85 : L 500   50     -50.000 -39.4993932862 -50.0000000000 434.30841 0.20000 250 0.300 1221.075 2090.089
#INFO :/home/jim/ddrl/agent/cacla/src/pybinding/nfac.cpp.85 : T 500   50     -50.000 -39.4993932862 -50.0000000000 434.30841 0.20000 250 0.300 1221.075 2090.089
episode 600 total steps 30000 last perf -50.0
#INFO :/home/jim/ddrl/agent/cacla/src/pybinding/nfac.cpp.85 : L 600   50     -50.000 -39.4993932862 -50.0000000000 425.04641 0.20000 250 0.300 1245.507 3301.051
#INFO :/home/jim/ddrl/agent/cacla/src/pybinding/nfac.cpp.85 : T 600   50     -50.000 -39.4993932862 -50.0000000000 425.04641 0.20000 250 0.300 1245.507 3301.051
episode 700 total steps 35000 last perf -50.0
#INFO :/home/jim/ddrl/agent/cacla/src/pybinding/nfac.cpp.85 : L 700   50     -50.000 -39.4993932862 -50.0000000000 395.15348 0.20000 250 0.288 1453.767 2843.399
#INFO :/home/jim/ddrl/agent/cacla/src/pybinding/nfac.cpp.85 : T 700   50     -50.000 -39.4993932862 -50.0000000000 395.15348 0.20000 250 0.288 1453.767 2843.399
episode 800 total steps 40000 last perf -50.0
#INFO :/home/jim/ddrl/agent/cacla/src/pybinding/nfac.cpp.85 : L 800   50     -7.000  -6.7934652093 -7.0000000000 343.70716 0.20000 250 0.340 1426.707 2757.104
#INFO :/home/jim/ddrl/agent/cacla/src/pybinding/nfac.cpp.85 : T 800   50     -50.000 -39.4993932862 -50.0000000000 343.70716 0.20000 250 0.340 1426.707 2757.104
episode 900 total steps 45000 last perf -50.0
#INFO :/home/jim/ddrl/agent/cacla/src/pybinding/nfac.cpp.85 : L 900   50     -50.000 -39.4993932862 -50.0000000000 313.93685 0.20000 250 0.344 1444.552 3218.356
#INFO :/home/jim/ddrl/agent/cacla/src/pybinding/nfac.cpp.85 : T 900   50     -50.000 -39.4993932862 -50.0000000000 313.93685 0.20000 250 0.344 1444.552 3218.356
episode 1000 total steps 50000 last perf -50.0
#INFO :/home/jim/ddrl/agent/cacla/src/pybinding/nfac.cpp.85 : L 1000  50     -50.000 -39.4993932862 -50.0000000000 300.85292 0.20000 250 0.424 1414.471 4332.106
#INFO :/home/jim/ddrl/agent/cacla/src/pybinding/nfac.cpp.85 : T 1000  50     -50.000 -39.4993932862 -50.0000000000 300.85292 0.20000 250 0.424 1414.471 4332.106
episode 1100 total steps 55000 last perf -50.0
#INFO :/home/jim/ddrl/agent/cacla/src/pybinding/nfac.cpp.85 : L 1100  50     -50.000 -39.4993932862 -50.0000000000 301.20900 0.20000 250 0.368 1413.057 3611.521
#INFO :/home/jim/ddrl/agent/cacla/src/pybinding/nfac.cpp.85 : T 1100  50     -50.000 -39.4993932862 -50.0000000000 301.20900 0.20000 250 0.368 1413.057 3611.521
episode 1200 total steps 60000 last perf -50.0
#INFO :/home/jim/ddrl/agent/cacla/src/pybinding/nfac.cpp.85 : L 1200  50     -50.000 -39.4993932862 -50.0000000000 302.35951 0.20000 250 0.380 1421.981 3232.280
#INFO :/home/jim/ddrl/agent/cacla/src/pybinding/nfac.cpp.85 : T 1200  50     -50.000 -39.4993932862 -50.0000000000 302.35951 0.20000 250 0.380 1421.981 3232.280
episode 1300 total steps 65000 last perf -50.0
#INFO :/home/jim/ddrl/agent/cacla/src/pybinding/nfac.cpp.85 : L 1300  50     -45.000 -34.5983982762 -45.0000000000 313.15894 0.20000 250 0.264 1431.092 3603.456
#INFO :/home/jim/ddrl/agent/cacla/src/pybinding/nfac.cpp.85 : T 1300  50     -50.000 -39.4993932862 -50.0000000000 313.15894 0.20000 250 0.264 1431.092 3603.456
episode 1400 total steps 70000 last perf -50.0
#INFO :/home/jim/ddrl/agent/cacla/src/pybinding/nfac.cpp.85 : L 1400  50     -50.000 -39.4993932862 -50.0000000000 293.27581 0.20000 250 0.396 1432.452 2308.182
#INFO :/home/jim/ddrl/agent/cacla/src/pybinding/nfac.cpp.85 : T 1400  50     -50.000 -39.4993932862 -50.0000000000 293.27581 0.20000 250 0.396 1432.452 2308.182
episode 1500 total steps 75000 last perf -50.0
#INFO :/home/jim/ddrl/agent/cacla/src/pybinding/nfac.cpp.85 : L 1500  50     -50.000 -39.4993932862 -50.0000000000 299.37081 0.20000 250 0.420 1440.287 2513.630
#INFO :/home/jim/ddrl/agent/cacla/src/pybinding/nfac.cpp.85 : T 1500  50     -50.000 -39.4993932862 -50.0000000000 299.37081 0.20000 250 0.420 1440.287 2513.630
episode 1600 total steps 80000 last perf -50.0
#INFO :/home/jim/ddrl/agent/cacla/src/pybinding/nfac.cpp.85 : L 1600  50     0.000   0.0000000000 0.0000000000 188.98480 0.20000 250 0.364 1408.132 2232.789
#INFO :/home/jim/ddrl/agent/cacla/src/pybinding/nfac.cpp.85 : T 1600  50     -50.000 -39.4993932862 -50.0000000000 188.98480 0.20000 250 0.364 1408.132 2232.789
episode 1700 total steps 85000 last perf -50.0
#INFO :/home/jim/ddrl/agent/cacla/src/pybinding/nfac.cpp.85 : L 1700  50     -50.000 -39.4993932862 -50.0000000000 288.04573 0.20000 250 0.436 1444.980 2849.277
#INFO :/home/jim/ddrl/agent/cacla/src/pybinding/nfac.cpp.85 : T 1700  50     -50.000 -39.4993932862 -50.0000000000 288.04573 0.20000 250 0.436 1444.980 2849.277
episode 1800 total steps 90000 last perf -50.0
#INFO :/home/jim/ddrl/agent/cacla/src/pybinding/nfac.cpp.85 : L 1800  50     -50.000 -39.4993932862 -50.0000000000 308.32325 0.20000 250 0.392 1501.293 3345.464
#INFO :/home/jim/ddrl/agent/cacla/src/pybinding/nfac.cpp.85 : T 1800  50     -50.000 -39.4993932862 -50.0000000000 308.32325 0.20000 250 0.392 1501.293 3345.464
episode 1900 total steps 95000 last perf -50.0
#INFO :/home/jim/ddrl/agent/cacla/src/pybinding/nfac.cpp.85 : L 1900  50     -50.000 -39.4993932862 -50.0000000000 331.21574 0.20000 250 0.312 1524.992 2625.236
#INFO :/home/jim/ddrl/agent/cacla/src/pybinding/nfac.cpp.85 : T 1900  50     -50.000 -39.4993932862 -50.0000000000 331.21574 0.20000 250 0.312 1524.992 2625.236
episode 2000 total steps 100000 last perf -50.0
#INFO :/home/jim/ddrl/agent/cacla/src/pybinding/nfac.cpp.85 : L 2000  50     -50.000 -39.4993932862 -50.0000000000 342.53142 0.20000 250 0.304 1562.716 3283.008
#INFO :/home/jim/ddrl/agent/cacla/src/pybinding/nfac.cpp.85 : T 2000  50     -50.000 -39.4993932862 -50.0000000000 342.53142 0.20000 250 0.304 1562.716 3283.008
episode 2100 total steps 105000 last perf -50.0
#INFO :/home/jim/ddrl/agent/cacla/src/pybinding/nfac.cpp.85 : L 2100  50     -50.000 -39.4993932862 -50.0000000000 314.89381 0.20000 250 0.348 1662.225 2576.668
#INFO :/home/jim/ddrl/agent/cacla/src/pybinding/nfac.cpp.85 : T 2100  50     -50.000 -39.4993932862 -50.0000000000 314.89381 0.20000 250 0.348 1662.225 2576.668
episode 2200 total steps 110000 last perf -50.0
#INFO :/home/jim/ddrl/agent/cacla/src/pybinding/nfac.cpp.85 : L 2200  50     -50.000 -39.4993932862 -50.0000000000 325.43476 0.20000 250 0.368 1689.377 2315.436
#INFO :/home/jim/ddrl/agent/cacla/src/pybinding/nfac.cpp.85 : T 2200  50     -50.000 -39.4993932862 -50.0000000000 325.43476 0.20000 250 0.368 1689.377 2315.436
episode 2300 total steps 115000 last perf -50.0
#INFO :/home/jim/ddrl/agent/cacla/src/pybinding/nfac.cpp.85 : L 2300  50     -50.000 -39.4993932862 -50.0000000000 323.92725 0.20000 250 0.336 1712.382 2859.400
#INFO :/home/jim/ddrl/agent/cacla/src/pybinding/nfac.cpp.85 : T 2300  50     -50.000 -39.4993932862 -50.0000000000 323.92725 0.20000 250 0.336 1712.382 2859.400
episode 2400 total steps 120000 last perf -50.0
#INFO :/home/jim/ddrl/agent/cacla/src/pybinding/nfac.cpp.85 : L 2400  50     -50.000 -39.4993932862 -50.0000000000 302.64502 0.20000 250 0.380 1797.492 2897.479
#INFO :/home/jim/ddrl/agent/cacla/src/pybinding/nfac.cpp.85 : T 2400  50     -50.000 -39.4993932862 -50.0000000000 302.64502 0.20000 250 0.380 1797.492 2897.479
episode 2500 total steps 125000 last perf 0.0
#INFO :/home/jim/ddrl/agent/cacla/src/pybinding/nfac.cpp.85 : L 2500  50     -50.000 -39.4993932862 -50.0000000000 74.62224 0.20000 250 0.404 1732.337 3855.214
#INFO :/home/jim/ddrl/agent/cacla/src/pybinding/nfac.cpp.85 : T 2500  50     -50.000 -39.4993932862 -50.0000000000 74.62224 0.20000 250 0.404 1732.337 3855.214
episode 2600 total steps 130000 last perf -50.0
#INFO :/home/jim/ddrl/agent/cacla/src/pybinding/nfac.cpp.85 : L 2600  50     -50.000 -39.4993932862 -50.0000000000 302.42988 0.20000 250 0.384 1821.087 3403.937
#INFO :/home/jim/ddrl/agent/cacla/src/pybinding/nfac.cpp.85 : T 2600  50     -50.000 -39.4993932862 -50.0000000000 302.42988 0.20000 250 0.384 1821.087 3403.937
episode 2700 total steps 135000 last perf -50.0
#INFO :/home/jim/ddrl/agent/cacla/src/pybinding/nfac.cpp.85 : L 2700  50     -50.000 -39.4993932862 -50.0000000000 309.27921 0.20000 250 0.352 1825.681 2691.617
#INFO :/home/jim/ddrl/agent/cacla/src/pybinding/nfac.cpp.85 : T 2700  50     -50.000 -39.4993932862 -50.0000000000 309.27921 0.20000 250 0.352 1825.681 2691.617
episode 2800 total steps 140000 last perf -50.0
#INFO :/home/jim/ddrl/agent/cacla/src/pybinding/nfac.cpp.85 : L 2800  50     -50.000 -39.4993932862 -50.0000000000 303.45071 0.20000 250 0.452 1823.744 2267.838
#INFO :/home/jim/ddrl/agent/cacla/src/pybinding/nfac.cpp.85 : T 2800  50     -50.000 -39.4993932862 -50.0000000000 303.45071 0.20000 250 0.452 1823.744 2267.838
episode 2900 total steps 145000 last perf -50.0
#INFO :/home/jim/ddrl/agent/cacla/src/pybinding/nfac.cpp.85 : L 2900  50     -46.000 -35.5589942862 -46.0000000000 300.21120 0.20000 250 0.380 1832.871 2108.694
#INFO :/home/jim/ddrl/agent/cacla/src/pybinding/nfac.cpp.85 : T 2900  50     -50.000 -39.4993932862 -50.0000000000 300.21120 0.20000 250 0.380 1832.871 2108.694
episode 3000 total steps 150000 last perf 0.0
#INFO :/home/jim/ddrl/agent/cacla/src/pybinding/nfac.cpp.85 : L 3000  50     -50.000 -39.4993932862 -50.0000000000 296.41106 0.20000 250 0.396 1840.616 2203.382
#INFO :/home/jim/ddrl/agent/cacla/src/pybinding/nfac.cpp.85 : T 3000  50     -50.000 -39.4993932862 -50.0000000000 296.41106 0.20000 250 0.396 1840.616 2203.382
episode 3100 total steps 155000 last perf -47.0
#INFO :/home/jim/ddrl/agent/cacla/src/pybinding/nfac.cpp.85 : L 3100  50     -50.000 -39.4993932862 -50.0000000000 304.23030 0.20000 250 0.392 1815.067 2564.230
#INFO :/home/jim/ddrl/agent/cacla/src/pybinding/nfac.cpp.85 : T 3100  50     -50.000 -39.4993932862 -50.0000000000 304.23030 0.20000 250 0.392 1815.067 2564.230
episode 3200 total steps 160000 last perf -50.0
#INFO :/home/jim/ddrl/agent/cacla/src/pybinding/nfac.cpp.85 : L 3200  50     -50.000 -39.4993932862 -50.0000000000 314.90150 0.20000 250 0.352 1807.986 2344.424
#INFO :/home/jim/ddrl/agent/cacla/src/pybinding/nfac.cpp.85 : T 3200  50     -50.000 -39.4993932862 -50.0000000000 314.90150 0.20000 250 0.352 1807.986 2344.424
episode 3300 total steps 165000 last perf -50.0
#INFO :/home/jim/ddrl/agent/cacla/src/pybinding/nfac.cpp.85 : L 3300  50     -50.000 -39.4993932862 -50.0000000000 294.56957 0.20000 250 0.408 1827.943 2092.198
#INFO :/home/jim/ddrl/agent/cacla/src/pybinding/nfac.cpp.85 : T 3300  50     -50.000 -39.4993932862 -50.0000000000 294.56957 0.20000 250 0.408 1827.943 2092.198
episode 3400 total steps 170000 last perf -50.0
#INFO :/home/jim/ddrl/agent/cacla/src/pybinding/nfac.cpp.85 : L 3400  50     -50.000 -39.4993932862 -50.0000000000 319.27477 0.20000 250 0.312 1847.207 2624.171
#INFO :/home/jim/ddrl/agent/cacla/src/pybinding/nfac.cpp.85 : T 3400  50     -50.000 -39.4993932862 -50.0000000000 319.27477 0.20000 250 0.312 1847.207 2624.171
episode 3500 total steps 175000 last perf -50.0
#INFO :/home/jim/ddrl/agent/cacla/src/pybinding/nfac.cpp.85 : L 3500  50     -50.000 -39.4993932862 -50.0000000000 302.88077 0.20000 250 0.396 1832.862 2764.258
#INFO :/home/jim/ddrl/agent/cacla/src/pybinding/nfac.cpp.85 : T 3500  50     -50.000 -39.4993932862 -50.0000000000 302.88077 0.20000 250 0.396 1832.862 2764.258
episode 3600 total steps 180000 last perf -50.0
#INFO :/home/jim/ddrl/agent/cacla/src/pybinding/nfac.cpp.85 : L 3600  50     -50.000 -39.4993932862 -50.0000000000 189.69495 0.20000 250 0.328 1798.681 2894.372
#INFO :/home/jim/ddrl/agent/cacla/src/pybinding/nfac.cpp.85 : T 3600  50     -50.000 -39.4993932862 -50.0000000000 189.69495 0.20000 250 0.328 1798.681 2894.372
episode 3700 total steps 185000 last perf -50.0
#INFO :/home/jim/ddrl/agent/cacla/src/pybinding/nfac.cpp.85 : L 3700  50     -50.000 -39.4993932862 -50.0000000000 174.25073 0.20000 250 0.388 1799.426 3356.493
#INFO :/home/jim/ddrl/agent/cacla/src/pybinding/nfac.cpp.85 : T 3700  50     -50.000 -39.4993932862 -50.0000000000 174.25073 0.20000 250 0.388 1799.426 3356.493
episode 3800 total steps 190000 last perf -50.0
#INFO :/home/jim/ddrl/agent/cacla/src/pybinding/nfac.cpp.85 : L 3800  50     -50.000 -39.4993932862 -50.0000000000 318.32951 0.20000 250 0.356 1815.443 3215.285
#INFO :/home/jim/ddrl/agent/cacla/src/pybinding/nfac.cpp.85 : T 3800  50     -50.000 -39.4993932862 -50.0000000000 318.32951 0.20000 250 0.356 1815.443 3215.285
episode 3900 total steps 195000 last perf -50.0
#INFO :/home/jim/ddrl/agent/cacla/src/pybinding/nfac.cpp.85 : L 3900  50     -50.000 -39.4993932862 -50.0000000000 190.00381 0.20000 250 0.372 1768.111 3086.686
#INFO :/home/jim/ddrl/agent/cacla/src/pybinding/nfac.cpp.85 : T 3900  50     -50.000 -39.4993932862 -50.0000000000 190.00381 0.20000 250 0.372 1768.111 3086.686
episode 4000 total steps 200000 last perf -50.0
#INFO :/home/jim/ddrl/agent/cacla/src/pybinding/nfac.cpp.85 : L 4000  50     -50.000 -39.4993932862 -50.0000000000 305.49717 0.20000 250 0.320 1800.324 2677.632
#INFO :/home/jim/ddrl/agent/cacla/src/pybinding/nfac.cpp.85 : T 4000  50     -50.000 -39.4993932862 -50.0000000000 305.49717 0.20000 250 0.320 1800.324 2677.632
episode 4100 total steps 205000 last perf -50.0
#INFO :/home/jim/ddrl/agent/cacla/src/pybinding/nfac.cpp.85 : L 4100  50     -50.000 -39.4993932862 -50.0000000000 186.22632 0.20000 250 0.296 1756.212 2525.554
#INFO :/home/jim/ddrl/agent/cacla/src/pybinding/nfac.cpp.85 : T 4100  50     -50.000 -39.4993932862 -50.0000000000 186.22632 0.20000 250 0.296 1756.212 2525.554
episode 4200 total steps 210000 last perf -50.0
#INFO :/home/jim/ddrl/agent/cacla/src/pybinding/nfac.cpp.85 : L 4200  50     -50.000 -39.4993932862 -50.0000000000 186.62446 0.20000 250 0.312 1745.142 2559.601
#INFO :/home/jim/ddrl/agent/cacla/src/pybinding/nfac.cpp.85 : T 4200  50     -50.000 -39.4993932862 -50.0000000000 186.62446 0.20000 250 0.312 1745.142 2559.601
episode 4300 total steps 215000 last perf -50.0
#INFO :/home/jim/ddrl/agent/cacla/src/pybinding/nfac.cpp.85 : L 4300  50     -50.000 -39.4993932862 -50.0000000000 92.11743 0.20000 250 0.424 1706.861 4155.797
#INFO :/home/jim/ddrl/agent/cacla/src/pybinding/nfac.cpp.85 : T 4300  50     -50.000 -39.4993932862 -50.0000000000 92.11743 0.20000 250 0.424 1706.861 4155.797
episode 4400 total steps 220000 last perf -50.0
#INFO :/home/jim/ddrl/agent/cacla/src/pybinding/nfac.cpp.85 : L 4400  50     -50.000 -39.4993932862 -50.0000000000 305.29147 0.20000 250 0.372 1807.624 3354.478
#INFO :/home/jim/ddrl/agent/cacla/src/pybinding/nfac.cpp.85 : T 4400  50     -50.000 -39.4993932862 -50.0000000000 305.29147 0.20000 250 0.372 1807.624 3354.478
episode 4500 total steps 225000 last perf -50.0
#INFO :/home/jim/ddrl/agent/cacla/src/pybinding/nfac.cpp.85 : L 4500  50     -50.000 -39.4993932862 -50.0000000000 309.62576 0.20000 250 0.400 1839.398 2642.012
#INFO :/home/jim/ddrl/agent/cacla/src/pybinding/nfac.cpp.85 : T 4500  50     -50.000 -39.4993932862 -50.0000000000 309.62576 0.20000 250 0.400 1839.398 2642.012
episode 4600 total steps 230000 last perf -50.0
#INFO :/home/jim/ddrl/agent/cacla/src/pybinding/nfac.cpp.85 : L 4600  50     -50.000 -39.4993932862 -50.0000000000 297.69136 0.20000 250 0.460 1855.533 3569.598
#INFO :/home/jim/ddrl/agent/cacla/src/pybinding/nfac.cpp.85 : T 4600  50     -50.000 -39.4993932862 -50.0000000000 297.69136 0.20000 250 0.460 1855.533 3569.598
episode 4700 total steps 235000 last perf -50.0
#INFO :/home/jim/ddrl/agent/cacla/src/pybinding/nfac.cpp.85 : L 4700  50     0.000   0.0000000000 0.0000000000 303.72088 0.20000 250 0.392 1881.793 2197.497
#INFO :/home/jim/ddrl/agent/cacla/src/pybinding/nfac.cpp.85 : T 4700  50     -50.000 -39.4993932862 -50.0000000000 303.72088 0.20000 250 0.392 1881.793 2197.497
episode 4800 total steps 240000 last perf -50.0
#INFO :/home/jim/ddrl/agent/cacla/src/pybinding/nfac.cpp.85 : L 4800  50     -50.000 -39.4993932862 -50.0000000000 182.33329 0.20000 250 0.344 1847.079 3410.346
#INFO :/home/jim/ddrl/agent/cacla/src/pybinding/nfac.cpp.85 : T 4800  50     -50.000 -39.4993932862 -50.0000000000 182.33329 0.20000 250 0.344 1847.079 3410.346
episode 4900 total steps 245000 last perf -50.0
#INFO :/home/jim/ddrl/agent/cacla/src/pybinding/nfac.cpp.85 : L 4900  50     -50.000 -39.4993932862 -50.0000000000 297.86280 0.20000 250 0.456 1911.227 3078.142
#INFO :/home/jim/ddrl/agent/cacla/src/pybinding/nfac.cpp.85 : T 4900  50     -50.000 -39.4993932862 -50.0000000000 297.86280 0.20000 250 0.456 1911.227 3078.142
episode 5000 total steps 250000 last perf -50.0
#INFO :/home/jim/ddrl/agent/cacla/src/pybinding/nfac.cpp.85 : L 5000  50     -50.000 -39.4993932862 -50.0000000000 306.03380 0.20000 250 0.420 1913.015 2469.496
#INFO :/home/jim/ddrl/agent/cacla/src/pybinding/nfac.cpp.85 : T 5000  50     -50.000 -39.4993932862 -50.0000000000 306.03380 0.20000 250 0.420 1913.015 2469.496
episode 5100 total steps 255000 last perf -50.0
#INFO :/home/jim/ddrl/agent/cacla/src/pybinding/nfac.cpp.85 : L 5100  50     -50.000 -39.4993932862 -50.0000000000 306.94059 0.20000 250 0.468 1928.741 2002.506
#INFO :/home/jim/ddrl/agent/cacla/src/pybinding/nfac.cpp.85 : T 5100  50     -50.000 -39.4993932862 -50.0000000000 306.94059 0.20000 250 0.468 1928.741 2002.506
episode 5200 total steps 260000 last perf -50.0
#INFO :/home/jim/ddrl/agent/cacla/src/pybinding/nfac.cpp.85 : L 5200  50     -50.000 -39.4993932862 -50.0000000000 186.94835 0.20000 250 0.324 1915.164 2217.157
#INFO :/home/jim/ddrl/agent/cacla/src/pybinding/nfac.cpp.85 : T 5200  50     -50.000 -39.4993932862 -50.0000000000 186.94835 0.20000 250 0.324 1915.164 2217.157
episode 5300 total steps 265000 last perf -50.0
#INFO :/home/jim/ddrl/agent/cacla/src/pybinding/nfac.cpp.85 : L 5300  50     -50.000 -39.4993932862 -50.0000000000 308.41859 0.20000 250 0.356 1961.006 2286.489
#INFO :/home/jim/ddrl/agent/cacla/src/pybinding/nfac.cpp.85 : T 5300  50     -50.000 -39.4993932862 -50.0000000000 308.41859 0.20000 250 0.356 1961.006 2286.489
episode 5400 total steps 270000 last perf -50.0
#INFO :/home/jim/ddrl/agent/cacla/src/pybinding/nfac.cpp.85 : L 5400  50     -50.000 -39.4993932862 -50.0000000000 321.18466 0.20000 250 0.356 1953.632 2244.455
#INFO :/home/jim/ddrl/agent/cacla/src/pybinding/nfac.cpp.85 : T 5400  50     -50.000 -39.4993932862 -50.0000000000 321.18466 0.20000 250 0.356 1953.632 2244.455
episode 5500 total steps 275000 last perf -50.0
#INFO :/home/jim/ddrl/agent/cacla/src/pybinding/nfac.cpp.85 : L 5500  50     -50.000 -39.4993932862 -50.0000000000 321.13202 0.20000 250 0.344 1972.188 2446.748
#INFO :/home/jim/ddrl/agent/cacla/src/pybinding/nfac.cpp.85 : T 5500  50     -50.000 -39.4993932862 -50.0000000000 321.13202 0.20000 250 0.344 1972.188 2446.748
episode 5600 total steps 280000 last perf -48.0
#INFO :/home/jim/ddrl/agent/cacla/src/pybinding/nfac.cpp.85 : L 5600  50     -50.000 -39.4993932862 -50.0000000000 195.19788 0.20000 250 0.324 1914.795 2102.640
#INFO :/home/jim/ddrl/agent/cacla/src/pybinding/nfac.cpp.85 : T 5600  50     -50.000 -39.4993932862 -50.0000000000 195.19788 0.20000 250 0.324 1914.795 2102.640
episode 5700 total steps 285000 last perf -47
episode 5800 total steps 290000 last perf -50.0
#INFO :/home/jim/ddrl/agent/cacla/src/pybinding/nfac.cpp.85 : L 5800  50     -50.000 -39.4993932862 -50.0000000000 306.55587 0.20000 250 0.496 1943.434 2069.213
#INFO :/home/jim/ddrl/agent/cacla/src/pybinding/nfac.cpp.85 : T 5800  50     -50.000 -39.4993932862 -50.0000000000 306.55587 0.20000 250 0.496 1943.434 2069.213
episode 5900 total steps 295000 last perf 0.0
#INFO :/home/jim/ddrl/agent/cacla/src/pybinding/nfac.cpp.85 : L 5900  50     -50.000 -39.4993932862 -50.0000000000 202.24115 0.20000 250 0.324 1889.927 2214.111
#INFO :/home/jim/ddrl/agent/cacla/src/pybinding/nfac.cpp.85 : T 5900  50     -50.000 -39.4993932862 -50.0000000000 202.24115 0.20000 250 0.324 1889.927 2214.111
episode 6000 total steps 300000 last perf -50.0
#INFO :/home/jim/ddrl/agent/cacla/src/pybinding/nfac.cpp.85 : L 6000  50     -50.000 -39.4993932862 -50.0000000000 304.77453 0.20000 250 0.412 1937.542 1790.405
#INFO :/home/jim/ddrl/agent/cacla/src/pybinding/nfac.cpp.85 : T 6000  50     -50.000 -39.4993932862 -50.0000000000 304.77453 0.20000 250 0.412 1937.542 1790.405
huangjiancong1 commented 5 years ago

Perf in OpenAI-Gym/HalfCheetah-v2:

/home/jim/anaconda2/envs/clustering/lib/python3.5/site-packages/requests/__init__.py:91: RequestsDependencyWarning: urllib3 (1.24.2) or chardet (3.0.4) doesn't match a supported version!
  RequestsDependencyWarning)
{'config': 'config.ini', 'view': False, 'save_best': False, 'load': None, 'render': False, 'capture': False, 'test_only': False}
ENV:  <TimeLimit<HalfCheetahEnv<HalfCheetah-v2>>>
State space: Box(17,)
- low: [-inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf
 -inf -inf -inf]
- high: [inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf]
Action space: Box(6,)
- low: [-1. -1. -1. -1. -1. -1.]
- high: [1. 1. 1. 1. 1. 1.]
Create agent with (nb_motors, nb_sensors) :  6 17
main algo : PeNFAC(lambda)-V
episode 0 total steps 0 last perf 0
#INFO :/home/jim/ddrl/agent/cacla/src/pybinding/nfac.cpp.85 : L 0     1000   -47.115292 -1.8866550501 -47.1152922702 0.00000  0.20000 0 0.000 43.286 63.462
episode 100 total steps 100000 last perf 0.22751676727166603
#INFO :/home/jim/ddrl/agent/cacla/src/pybinding/nfac.cpp.85 : L 100   1000   -35.538 -4.6774409211 -35.5381946659 4.48089  0.20000 5000 0.491 405.149 83.786
#INFO :/home/jim/ddrl/agent/cacla/src/pybinding/nfac.cpp.85 : T 100   1000   -0.049  0.0439331957 -0.0487643295 4.48089  0.20000 5000 0.491 405.149 83.786
episode 200 total steps 200000 last perf 140.46418944639632
#INFO :/home/jim/ddrl/agent/cacla/src/pybinding/nfac.cpp.85 : L 200   1000   204.978 56.9361475313 204.9779973113 74.57837 0.20000 5000 0.488 654.829 149.948
#INFO :/home/jim/ddrl/agent/cacla/src/pybinding/nfac.cpp.85 : T 200   1000   48.583  3.4378766450 48.5834364685 74.57837 0.20000 5000 0.488 654.829 149.948
episode 300 total steps 300000 last perf 941.4459374037215
#INFO :/home/jim/ddrl/agent/cacla/src/pybinding/nfac.cpp.85 : L 300   1000   319.603 86.4225904655 319.6026779233 307.82436 0.20000 5000 0.491 835.650 181.359
#INFO :/home/jim/ddrl/agent/cacla/src/pybinding/nfac.cpp.85 : T 300   1000   649.431 123.0760117041 649.4310041797 307.82436 0.20000 5000 0.491 835.650 181.359
episode 400 total steps 400000 last perf 1998.9159226163288
#INFO :/home/jim/ddrl/agent/cacla/src/pybinding/nfac.cpp.85 : L 400   1000   1541.706 137.0518755685 1541.7061888397 2129.72309 0.20000 5000 0.537 1031.988 213.218
#INFO :/home/jim/ddrl/agent/cacla/src/pybinding/nfac.cpp.85 : T 400   1000   2010.571 136.1758744628 2010.5709348364 2129.72309 0.20000 5000 0.537 1031.988 213.218
episode 500 total steps 500000 last perf 678.2993231530708
#INFO :/home/jim/ddrl/agent/cacla/src/pybinding/nfac.cpp.85 : L 500   1000   191.508 99.8058855712 191.5079652816 5862.61120 0.20000 5000 0.568 1227.944 245.120
#INFO :/home/jim/ddrl/agent/cacla/src/pybinding/nfac.cpp.85 : T 500   1000   2799.371 232.6936888905 2799.3706928199 5862.61120 0.20000 5000 0.568 1227.944 245.120
episode 600 total steps 600000 last perf 3279.0396426752723
#INFO :/home/jim/ddrl/agent/cacla/src/pybinding/nfac.cpp.85 : L 600   1000   2524.436 202.3936700988 2524.4356832294 9601.44991 0.20000 5000 0.552 1287.929 266.153
#INFO :/home/jim/ddrl/agent/cacla/src/pybinding/nfac.cpp.85 : T 600   1000   3428.879 278.4329341160 3428.8785063861 9601.44991 0.20000 5000 0.552 1287.929 266.153
huangjiancong1 commented 5 years ago

For just ddpg for FetchcReach, his reward also always -50 and the performance in mujoco also not reach the desired_goal.

command: python -m baselines.run --alg=ddpg --env=FetchReach-v1 --num_timesteps=5000 --play

(clustering) jim@jim-Inspiron-7577:~/baselines $ python -m baselines.run --alg=ddpg --env=FetchReach-v1 --num_timesteps=5000 --play
/home/jim/anaconda2/envs/clustering/lib/python3.5/site-packages/requests/__init__.py:91: RequestsDependencyWarning: urllib3 (1.24.2) or chardet (3.0.4) doesn't match a supported version!
  RequestsDependencyWarning)
Logging to /tmp/openai-2019-06-24-09-15-06-825990
env_type: robotics
2019-06-24 09:15:14.135417: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA
2019-06-24 09:15:14.388060: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:895] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-06-24 09:15:14.388312: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1105] Found device 0 with properties: 
name: GeForce GTX 1050 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.62
pciBusID: 0000:01:00.0
totalMemory: 3.95GiB freeMemory: 3.44GiB
2019-06-24 09:15:14.388329: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1195] Creating TensorFlow device (/device:GPU:0) -> (device: 0, name: GeForce GTX 1050 Ti, pci bus id: 0000:01:00.0, compute capability: 6.1)
Training ddpg on robotics:FetchReach-v1 with arguments 
{'network': 'mlp'}
scaling actions by [1. 1. 1. 1.] before executing in env
setting up param noise
  param_noise_actor/mlp_fc0/w:0 <- actor/mlp_fc0/w:0 + noise
  param_noise_actor/mlp_fc0/b:0 <- actor/mlp_fc0/b:0 + noise
  param_noise_actor/mlp_fc1/w:0 <- actor/mlp_fc1/w:0 + noise
  param_noise_actor/mlp_fc1/b:0 <- actor/mlp_fc1/b:0 + noise
  param_noise_actor/dense/kernel:0 <- actor/dense/kernel:0 + noise
  param_noise_actor/dense/bias:0 <- actor/dense/bias:0 + noise
  adaptive_param_noise_actor/mlp_fc0/w:0 <- actor/mlp_fc0/w:0 + noise
  adaptive_param_noise_actor/mlp_fc0/b:0 <- actor/mlp_fc0/b:0 + noise
  adaptive_param_noise_actor/mlp_fc1/w:0 <- actor/mlp_fc1/w:0 + noise
  adaptive_param_noise_actor/mlp_fc1/b:0 <- actor/mlp_fc1/b:0 + noise
  adaptive_param_noise_actor/dense/kernel:0 <- actor/dense/kernel:0 + noise
  adaptive_param_noise_actor/dense/bias:0 <- actor/dense/bias:0 + noise
setting up actor optimizer
  actor shapes: [[16, 64], [64], [64, 64], [64], [64, 4], [4]]
  actor params: 5508
setting up critic optimizer
  regularizing: critic/mlp_fc0/w:0
  regularizing: critic/mlp_fc1/w:0
  applying l2 regularization with 0.01
  critic shapes: [[20, 64], [64], [64, 64], [64], [64, 1], [1]]
  critic params: 5569
setting up target updates ...
  target_actor/mlp_fc0/w:0 <- actor/mlp_fc0/w:0
  target_actor/mlp_fc0/b:0 <- actor/mlp_fc0/b:0
  target_actor/mlp_fc1/w:0 <- actor/mlp_fc1/w:0
  target_actor/mlp_fc1/b:0 <- actor/mlp_fc1/b:0
  target_actor/dense/kernel:0 <- actor/dense/kernel:0
  target_actor/dense/bias:0 <- actor/dense/bias:0
setting up target updates ...
  target_critic/mlp_fc0/w:0 <- critic/mlp_fc0/w:0
  target_critic/mlp_fc0/b:0 <- critic/mlp_fc0/b:0
  target_critic/mlp_fc1/w:0 <- critic/mlp_fc1/w:0
  target_critic/mlp_fc1/b:0 <- critic/mlp_fc1/b:0
  target_critic/output/kernel:0 <- critic/output/kernel:0
  target_critic/output/bias:0 <- critic/output/bias:0
Using agent with the following configuration:
dict_items([('clip_norm', None), ('target_init_updates', [<tf.Operation 'group_deps_4' type=NoOp>, <tf.Operation 'group_deps_6' type=NoOp>]), ('critic_with_actor_tf', <tf.Tensor 'clip_by_value_3:0' shape=(?, 1) dtype=float32>), ('perturb_adaptive_policy_ops', <tf.Operation 'group_deps_1' type=NoOp>), ('return_range', (-inf, inf)), ('obs1', <tf.Tensor 'obs1:0' shape=(?, 16) dtype=float32>), ('perturbed_actor_tf', <tf.Tensor 'param_noise_actor/Tanh_2:0' shape=(?, 4) dtype=float32>), ('actor_tf', <tf.Tensor 'actor/Tanh_2:0' shape=(?, 4) dtype=float32>), ('memory', <baselines.ddpg.memory.Memory object at 0x7f33d5a49b00>), ('actor_optimizer', <baselines.common.mpi_adam.MpiAdam object at 0x7f33c0ad6e80>), ('normalize_observations', True), ('critic_optimizer', <baselines.common.mpi_adam.MpiAdam object at 0x7f341f6cdb70>), ('terminals1', <tf.Tensor 'terminals1:0' shape=(?, 1) dtype=float32>), ('batch_size', 64), ('actor_grads', <tf.Tensor 'concat:0' shape=(5508,) dtype=float32>), ('actor_loss', <tf.Tensor 'Neg:0' shape=() dtype=float32>), ('initial_state', None), ('stats_ops', [<tf.Tensor 'Mean_3:0' shape=() dtype=float32>, <tf.Tensor 'Mean_4:0' shape=() dtype=float32>, <tf.Tensor 'Mean_5:0' shape=() dtype=float32>, <tf.Tensor 'Sqrt_1:0' shape=() dtype=float32>, <tf.Tensor 'Mean_8:0' shape=() dtype=float32>, <tf.Tensor 'Sqrt_2:0' shape=() dtype=float32>, <tf.Tensor 'Mean_11:0' shape=() dtype=float32>, <tf.Tensor 'Sqrt_3:0' shape=() dtype=float32>, <tf.Tensor 'Mean_14:0' shape=() dtype=float32>, <tf.Tensor 'Sqrt_4:0' shape=() dtype=float32>]), ('actor', <baselines.ddpg.models.Actor object at 0x7f33c2709358>), ('stats_sample', None), ('target_Q', <tf.Tensor 'add_2:0' shape=(?, 1) dtype=float32>), ('critic', <baselines.ddpg.models.Critic object at 0x7f33c2709320>), ('param_noise_stddev', <tf.Tensor 'param_noise_stddev:0' shape=() dtype=float32>), ('action_noise', None), ('observation_range', (-5.0, 5.0)), ('target_soft_updates', [<tf.Operation 'group_deps_5' type=NoOp>, <tf.Operation 'group_deps_7' type=NoOp>]), ('critic_loss', <tf.Tensor 'add_15:0' shape=() dtype=float32>), ('target_critic', <baselines.ddpg.models.Critic object at 0x7f33c2709470>), ('stats_names', ['obs_rms_mean', 'obs_rms_std', 'reference_Q_mean', 'reference_Q_std', 'reference_actor_Q_mean', 'reference_actor_Q_std', 'reference_action_mean', 'reference_action_std', 'reference_perturbed_action_mean', 'reference_perturbed_action_std']), ('ret_rms', None), ('critic_tf', <tf.Tensor 'clip_by_value_2:0' shape=(?, 1) dtype=float32>), ('normalized_critic_with_actor_tf', <tf.Tensor 'critic_1/output/BiasAdd:0' shape=(?, 1) dtype=float32>), ('gamma', 0.99), ('action_range', (-1.0, 1.0)), ('adaptive_policy_distance', <tf.Tensor 'Sqrt:0' shape=() dtype=float32>), ('normalize_returns', False), ('reward_scale', 1.0), ('critic_target', <tf.Tensor 'critic_target:0' shape=(?, 1) dtype=float32>), ('param_noise', AdaptiveParamNoiseSpec(initial_stddev=0.2, desired_action_stddev=0.2, adoption_coefficient=1.01)), ('enable_popart', False), ('actions', <tf.Tensor 'actions:0' shape=(?, 4) dtype=float32>), ('critic_grads', <tf.Tensor 'concat_2:0' shape=(5569,) dtype=float32>), ('perturb_policy_ops', <tf.Operation 'group_deps' type=NoOp>), ('normalized_critic_tf', <tf.Tensor 'critic/output/BiasAdd:0' shape=(?, 1) dtype=float32>), ('obs_rms', <baselines.common.mpi_running_mean_std.RunningMeanStd object at 0x7f33c2709eb8>), ('actor_lr', 0.0001), ('critic_lr', 0.001), ('obs0', <tf.Tensor 'obs0:0' shape=(?, 16) dtype=float32>), ('critic_l2_reg', 0.01), ('rewards', <tf.Tensor 'rewards:0' shape=(?, 1) dtype=float32>), ('target_actor', <baselines.ddpg.models.Actor object at 0x7f33c246e940>), ('tau', 0.01)])
---------------------------------------------
| obs_rms_mean                   | 0.49     |
| obs_rms_std                    | 0.156    |
| param_noise_stddev             | 0.164    |
| reference_action_mean          | 0.029    |
| reference_action_std           | 0.773    |
| reference_actor_Q_mean         | -7.02    |
| reference_actor_Q_std          | 0.745    |
| reference_perturbed_action_... | 0.033    |
| reference_perturbed_action_std | 0.781    |
| reference_Q_mean               | -7.11    |
| reference_Q_std                | 0.674    |
| rollout/actions_mean           | 0.0536   |
| rollout/actions_std            | 0.659    |
| rollout/episode_steps          | 50       |
| rollout/episodes               | 40       |
| rollout/Q_mean                 | -2.97    |
| rollout/return                 | -49.8    |
| rollout/return_history         | -49.8    |
| rollout/return_history_std     | 1.25     |
| rollout/return_std             | 1.25     |
| total/duration                 | 12.2     |
| total/episodes                 | 40       |
| total/epochs                   | 1        |
| total/steps                    | 2e+03    |
| total/steps_per_second         | 164      |
| train/loss_actor               | 6.83     |
| train/loss_critic              | 0.808    |
| train/param_noise_distance     | 0.596    |
---------------------------------------------

---------------------------------------------
| obs_rms_mean                   | 0.502    |
| obs_rms_std                    | 0.146    |
| param_noise_stddev             | 0.134    |
| reference_action_mean          | 0.107    |
| reference_action_std           | 0.784    |
| reference_actor_Q_mean         | -11.5    |
| reference_actor_Q_std          | 3.23     |
| reference_perturbed_action_... | 0.319    |
| reference_perturbed_action_std | 0.651    |
| reference_Q_mean               | -11.8    |
| reference_Q_std                | 2.89     |
| rollout/actions_mean           | 0.0836   |
| rollout/actions_std            | 0.686    |
| rollout/episode_steps          | 50       |
| rollout/episodes               | 80       |
| rollout/Q_mean                 | -6.57    |
| rollout/return                 | -49.8    |
| rollout/return_history         | -49.8    |
| rollout/return_history_std     | 0.972    |
| rollout/return_std             | 0.972    |
| total/duration                 | 22.9     |
| total/episodes                 | 80       |
| total/epochs                   | 2        |
| total/steps                    | 4e+03    |
| total/steps_per_second         | 175      |
| train/loss_actor               | 12       |
| train/loss_critic              | 1.98     |
| train/param_noise_distance     | 0.326    |
---------------------------------------------

Running trained model
Creating window glfw
episode_rew=-50.0
episode_rew=-50.0
episode_rew=-50.0
episode_rew=-50.0
episode_rew=-50.0
episode_rew=-50.0
episode_rew=-50.0
episode_rew=-50.0
episode_rew=-50.0
episode_rew=-50.0
episode_rew=-50.0
episode_rew=-50.0
episode_rew=-50.0
episode_rew=-50.0
episode_rew=-50.0
episode_rew=-50.0
episode_rew=-50.0
episode_rew=-50.0
episode_rew=-50.0
episode_rew=-50.0
episode_rew=-50.0
episode_rew=-50.0
episode_rew=-50.0
episode_rew=-50.0
episode_rew=-50.0
episode_rew=-50.0
episode_rew=-50.0
episode_rew=-50.0
episode_rew=-50.0
episode_rew=-50.0
episode_rew=-50.0
episode_rew=-50.0
episode_rew=-50.0
episode_rew=-50.0
episode_rew=-50.0
episode_rew=-50.0
episode_rew=-50.0
episode_rew=-50.0
episode_rew=-50.0
episode_rew=-50.0
episode_rew=-50.0
episode_rew=-50.0
episode_rew=-50.0
episode_rew=-50.0
episode_rew=-50.0
episode_rew=-50.0
matthieu637 commented 5 years ago

It's expected that vanilla PeNFAC can't easily solve this task because of the sparse rewards (as well as vanilla ddpg, vanilla PPO, etc.).

I developed a "data augmentation" module for PeNFAC similar to HER (even if we can't talk of off-policy replay here). See 426b203d8537de9c254d86c1939d8dc112ca6c10.

If you want to use it you need to: 1) change your config.ini to use "libddrl-hpenfac.so" instead of "libddrl-penfac.so" 2) add the command line argument "--goal-based" when you call python run.py 3) add the hyperparameter "hindsight_nb_destination=5" to [agent] section in config.ini

Here's preliminary results on the environment you tried: Screenshot_20190626_150630

config.ini:

...
[agent]
gamma=0.98
decision_each=1

#policy
noise=0.2
gaussian_policy=1

hidden_unit_v=64:64
hidden_unit_a=64:64
momentum=0
actor_output_layer_type=2
hidden_layer_type=1

#learning
alpha_a=0.0001
alpha_v=0.001
batch_norm_actor=7
batch_norm_critic=0

update_critic_first=true
number_fitted_iteration=10
stoch_iter_critic=1
lambda=0.9
gae=true
update_each_episode=3

stoch_iter_actor=1
beta_target=0.03
ignore_poss_ac=false
conserve_beta=true
disable_cac=false
disable_trust_region=true

hindsight_nb_destination=5
huangjiancong1 commented 5 years ago

OK, What do you suggest us to do next? Use success_rate or last reward to compare?

huangjiancong1 commented 5 years ago

@matthieu637 I modified the gym/run.py file to output the success_rate like this:

    while sample_steps_counter < total_max_steps + testing_each * max_steps:
        if episode % display_log_each == 0:
            success_rate = (results[-1]+max_steps)/max_steps if len(results) > 0 else 0
            n_epoch = episode // display_log_each
            print('n_epoch', n_epoch, 'success rate', success_rate)
            writer.add_scalar(env_name+'success_rate_hpenfac', success_rate, n_epoch+1)
            print('episode', episode, 'total steps', sample_steps_counter, 'last perf', results[-1] if len(results) > 0 else 0)

And the comparison of success_rate between ddpg+her and hpenfac with the original hyperparameters: original_ddpgher :~$ python -m baselines.run --alg=her --env=FetchPush-v1 --num_timesteps=2.5e6 original_hpenfac :~$ python run.py --goal-based

Here I used 2.5e6 total_max_steps and config.ini like this:

[simulation]
total_max_steps=2500000
testing_each=10
#number of trajectories for testing
testing_trials=10

dump_log_each=50
display_log_each=100
save_agent_each=100000

library=/home/jim/ddrl/agent/cacla/lib/libddrl-hpenfac.so
; env_name=RoboschoolHalfCheetah-v1
; env_name=HalfCheetah-v2
env_name=FetchPush-v1

[agent]
gamma=0.98
decision_each=1

#policy
noise=0.2
gaussian_policy=1

hidden_unit_v=64:64
hidden_unit_a=64:64
momentum=0
actor_output_layer_type=2
hidden_layer_type=1

#learning
alpha_a=0.0001
alpha_v=0.001
batch_norm_actor=7
batch_norm_critic=0
reward_scale=1.0

vnn_from_scratch=false
update_critic_first=true
number_fitted_iteration=10
stoch_iter_critic=1
lambda=0.9
gae=true
update_each_episode=3

stoch_iter_actor=1
beta_target=0.03
ignore_poss_ac=false
conserve_beta=true
disable_cac=false
disable_trust_region=true

hindsight_nb_destination=5

I prepare to use the hyperparameters from ddpg+her like here in sec.2.2, can you share me some tips to modify your code to use the same hyperparameters?

huangjiancong1 commented 5 years ago

On reach:

[simulation]
total_max_steps=2500000
testing_each=10
#number of trajectories for testing
testing_trials=10

dump_log_each=50
display_log_each=100
save_agent_each=100000

library=/home/jim/ddrl/agent/cacla/lib/libddrl-hpenfac.so
env_name=FetchReach-v1

[agent]
gamma=0.98
decision_each=1

#policy
noise=0.2
gaussian_policy=1

hidden_unit_v=64:64
hidden_unit_a=64:64
momentum=0
actor_output_layer_type=2
hidden_layer_type=1

#learning
alpha_a=0.0001
alpha_v=0.001
batch_norm_actor=7
batch_norm_critic=0
reward_scale=1.0

vnn_from_scratch=false
update_critic_first=true
number_fitted_iteration=10
stoch_iter_critic=1
lambda=0.9
gae=true
update_each_episode=3

stoch_iter_actor=1
beta_target=0.03
ignore_poss_ac=false
conserve_beta=true
disable_cac=false
disable_trust_region=true

hindsight_nb_destination=5

reach_oringinal :~$ python -m baselines.run --alg=her --env=FetchReach-v1 --num_timesteps=2.5e6

matthieu637 commented 5 years ago

For PeNFAC, you're computing a success rate equivalent to "how many times I reached the goal within one episode", whereas in HER it only checks "if the goal was reached at the end of the episode". The two curves are not comparable: PeNFAC is penalized since the intermediate steps count as failures.

huangjiancong1 commented 5 years ago

@matthieu637 Did you mean I can use the last perf from every test_episode to calculate the success_rate?

huangjiancong1 commented 5 years ago

@matthieu637 I use this python script to plot the success_rate from 0.1.monitor.csv:

import numpy as np
from numpy import genfromtxt
import matplotlib.pyplot as plt
import matplotlib
matplotlib.rcParams.update({'font.size': 12})
plt.rcParams["font.family"] = "Time New Roman"

episodes = 800
epochs = 200
env = 'FetchPush-v1'

data = genfromtxt('0.1.monitor.csv', delimiter=',')
is_success = data[:,3][1:len(data)]
to_epoch = is_success.reshape(epochs,episodes)

x,y =[],[]
for epoch, last_prefs in enumerate(to_epoch):
    success_rate = np.sum(last_prefs) / episodes
    x = x + [epoch]
    y = y + [success_rate]

plt.figure(figsize=(15,10))
plt.plot(x, y, marker='o', linestyle='-', markersize=2, linewidth=1, label='hpenfac')

plt.xlabel('n_epoch')
plt.ylabel('success rate')
plt.title(env)
plt.legend(loc=2)
plt.savefig(env+'.png')
plt.show()

The performance with the 64:64 like this:

FetchPush-v1

hyperparameters:

[simulation]
total_max_steps=8000000
testing_each=10
#number of trajectories for testing
testing_trials=10

dump_log_each=50
display_log_each=100
save_agent_each=100000

library=/home/jim/ddrl/agent/cacla/lib/libddrl-hpenfac.so
env_name=FetchPush-v1

[agent]
gamma=0.98
decision_each=1

#policy
noise=0.2
gaussian_policy=1

hidden_unit_v=64:64
hidden_unit_a=64:64
momentum=0
actor_output_layer_type=2
hidden_layer_type=1

#learning
alpha_a=0.0001
alpha_v=0.001
batch_norm_actor=7
batch_norm_critic=0
reward_scale=1.0

vnn_from_scratch=false
update_critic_first=true
number_fitted_iteration=10
stoch_iter_critic=1
lambda=0.9
gae=true
update_each_episode=3

stoch_iter_actor=1
beta_target=0.03
ignore_poss_ac=false
conserve_beta=true
disable_cac=false
disable_trust_region=true

hindsight_nb_destination=5

The performance with 256:256:256 like this:

FetchPush-v1

hyperparameters:

[simulation]
total_max_steps=8000000
testing_each=10
#number of trajectories for testing
testing_trials=10

dump_log_each=50
display_log_each=100
save_agent_each=100000

library=../agent/cacla/lib/libddrl-hpenfac.so
env_name=FetchPush-v1

[agent]
gamma=0.98
decision_each=1

#policy
noise=0.2
gaussian_policy=1

hidden_unit_v=256:256:256
hidden_unit_a=256:256:256
momentum=0
actor_output_layer_type=2
hidden_layer_type=3

#learning
alpha_a=0.001
alpha_v=0.001
batch_norm_actor=7
batch_norm_critic=0
reward_scale=1.0

vnn_from_scratch=false
update_critic_first=true
number_fitted_iteration=10
stoch_iter_critic=1
lambda=0.9
gae=true
update_each_episode=3

stoch_iter_actor=1
beta_target=0.03
ignore_poss_ac=false
conserve_beta=true
disable_cac=false
disable_trust_region=true

hindsight_nb_destination=5
matthieu637 commented 5 years ago

That's strange, using your python script, here's what I've got (64x64 units): Figure_1

Have you used the last version and rebuild the ddrl libraries? The only difference we have in config.ini is that in my case, I've got:

...
testing_each=1
testing_trials=1
...
huangjiancong1 commented 5 years ago

Are you sure it is training in FetchPush not in FetchReach? env_name=FetchPush-v1 in config.ini, If FetchPush has this performance, it is very unbelievable, here is the paper's performance: image

matthieu637 commented 5 years ago

My bad I'm talking about FetchReach-v1. For Reach I guess you have to start to optimize the hyperparameters.

huangjiancong1 commented 5 years ago

Thx @matthieu637, I can use DDPG+HER to see the performance in FetchReach later. But they haven't compared with FetchReach in paper(https://arxiv.org/abs/1707.01495), I think hard to compare because its performance changed a lot and faster go to 100% success rate.

I don't understand what you mean in the second sentence. I am manually to change the hyperparameters for FetchPush, I study how to use lhpo. But I worrying I can not finish the hyper-optimize before the deadline.

huangjiancong1 commented 5 years ago

@matthieu637 For the FetchReach-v1 task with DDPG+HER, it is supper outpreformance, it only need 4epochsx10episodex50timesteps = 2000 total_timesteps can have 100% test_success_rate

 jim@jim-Inspiron-7577:~/baselines $ python -m baselines.run --alg=her --env=FetchReach-v1 --num_timesteps=8e5 --n_cycles=10
/home/jim/anaconda2/envs/clustering/lib/python3.5/site-packages/requests/__init__.py:91: RequestsDependencyWarning: urllib3 (1.24.2) or chardet (3.0.4) doesn't match a supported version!
  RequestsDependencyWarning)
Logging to /tmp/openai-2019-06-28-20-56-42-334498
env_type: robotics
2019-06-28 20:56:43.049827: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA
2019-06-28 20:56:43.051216: E tensorflow/stream_executor/cuda/cuda_driver.cc:406] failed call to cuInit: CUDA_ERROR_UNKNOWN
2019-06-28 20:56:43.051249: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:158] retrieving CUDA diagnostic information for host: jim-Inspiron-7577
2019-06-28 20:56:43.051259: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:165] hostname: jim-Inspiron-7577
2019-06-28 20:56:43.051294: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:189] libcuda reported version is: 410.78.0
2019-06-28 20:56:43.051319: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:369] driver version file contents: """NVRM version: NVIDIA UNIX x86_64 Kernel Module  410.78  Sat Nov 10 22:09:04 CST 2018
GCC version:  gcc version 5.4.0 20160609 (Ubuntu 5.4.0-6ubuntu1~16.04.11) 
"""
2019-06-28 20:56:43.051335: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:193] kernel reported version is: 410.78.0
2019-06-28 20:56:43.051343: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:300] kernel version seems to match DSO: 410.78.0
Training her on robotics:FetchReach-v1 with arguments 
{'network': 'mlp', 'n_cycles': 10}
T: 50
_Q_lr: 0.001
_action_l2: 1.0
_batch_size: 256
_buffer_size: 1000000
_clip_obs: 200.0
_hidden: 256
_layers: 3
_max_u: 1.0
_network_class: baselines.her.actor_critic:ActorCritic
_norm_clip: 5
_norm_eps: 0.01
_pi_lr: 0.001
_polyak: 0.95
_relative_goals: False
_scope: ddpg
aux_loss_weight: 0.0078
bc_loss: 0
ddpg_params: {'batch_size': 256, 'max_u': 1.0, 'action_l2': 1.0, 'network_class': 'baselines.her.actor_critic:ActorCritic', 'norm_clip': 5, 'polyak': 0.95, 'buffer_size': 1000000, 'layers': 3, 'clip_obs': 200.0, 'scope': 'ddpg', 'norm_eps': 0.01, 'hidden': 256, 'relative_goals': False, 'pi_lr': 0.001, 'Q_lr': 0.001}
demo_batch_size: 128
env_name: FetchReach-v1
gamma: 0.98
make_env: <function prepare_params.<locals>.make_env at 0x7f12f5fac510>
n_batches: 40
n_cycles: 10
n_test_rollouts: 10
noise_eps: 0.2
num_demo: 100
prm_loss_weight: 0.001
q_filter: 0
random_eps: 0.3
replay_k: 4
replay_strategy: future
rollout_batch_size: 1
test_with_polyak: False

*** Warning ***
You are running HER with just a single MPI worker. This will work, but the experiments that we report in Plappert et al. (2018, https://arxiv.org/abs/1802.09464) were obtained with --num_cpu 19. This makes a significant difference and if you are looking to reproduce those results, be aware of this. Please also refer to https://github.com/openai/baselines/issues/314 for further details.
****************

Creating a DDPG agent with action space 4 x 1.0...
Training...
---------------------------------
| epoch              | 0        |
| stats_g/mean       | 0.914    |
| stats_g/std        | 0.107    |
| stats_o/mean       | 0.271    |
| stats_o/std        | 0.0339   |
| test/episode       | 10       |
| test/mean_Q        | -0.356   |
| test/success_rate  | 0.6      |
| train/episode      | 10       |
| train/success_rate | 0        |
---------------------------------
---------------------------------
| epoch              | 1        |
| stats_g/mean       | 0.885    |
| stats_g/std        | 0.112    |
| stats_o/mean       | 0.264    |
| stats_o/std        | 0.0351   |
| test/episode       | 20       |
| test/mean_Q        | -1.02    |
| test/success_rate  | 0.7      |
| train/episode      | 20       |
| train/success_rate | 0.7      |
---------------------------------
---------------------------------
| epoch              | 2        |
| stats_g/mean       | 0.881    |
| stats_g/std        | 0.11     |
| stats_o/mean       | 0.263    |
| stats_o/std        | 0.035    |
| test/episode       | 30       |
| test/mean_Q        | -0.579   |
| test/success_rate  | 1        |
| train/episode      | 30       |
| train/success_rate | 0.8      |
---------------------------------
---------------------------------
| epoch              | 3        |
| stats_g/mean       | 0.874    |
| stats_g/std        | 0.107    |
| stats_o/mean       | 0.261    |
| stats_o/std        | 0.0343   |
| test/episode       | 40       |
| test/mean_Q        | -0.553   |
| test/success_rate  | 1        |
| train/episode      | 40       |
| train/success_rate | 0.8      |
---------------------------------
---------------------------------
| epoch              | 4        |
| stats_g/mean       | 0.874    |
| stats_g/std        | 0.103    |
| stats_o/mean       | 0.261    |
| stats_o/std        | 0.0335   |
| test/episode       | 50       |
| test/mean_Q        | -0.526   |
| test/success_rate  | 1        |
| train/episode      | 50       |
| train/success_rate | 1        |
---------------------------------
matthieu637 commented 5 years ago

Fixed in last commit.

Screenshot_20190918_103642

Produced without hyperoptimization params:

[simulation]
total_max_steps = 2000000
testing_each = 100
testing_trials = 40
dump_log_each = 1
display_log_each = 200
save_agent_each = 10000000
library = ..../ddrl/agent/cacla/lib/libddrl-hpenfac.so
env_name=FetchPush-v1

[agent]
gamma = 0.98
noise = 0.35
gaussian_policy = 1
hidden_unit_v = 256:256:256
hidden_unit_a = 256:256:256
actor_output_layer_type = 2
hidden_layer_type = 3
alpha_a = 0.0005
alpha_v = 0.001

number_fitted_iteration = 10
stoch_iter_critic = 1
lambda = 0.6
gae = true
update_each_episode = 40
stoch_iter_actor = 10
beta_target = 0.03
ignore_poss_ac = false
conserve_beta = true
disable_cac = false
disable_trust_region = true
hindsight_nb_destination = 3
huangjiancong1 commented 4 years ago

@matthieu637 Hi, Mat. Do you remember that which trick you used that make Hindsight Augmentation work with PeNFAC?