Possible problem with GPU configuration on experimental branch

snicolet commented 5 years ago

Hi,

When working on a sub-branch of Olivier Teytaud's branch called "newtasks" (which uses the ELF framework for any abstract game), we stumbled on a possible GPU configuration error on run time after a successful compile.

Steps to reproduce:

• git checkout "tristan_breakthrough"
• source activate ELF
• make -j
• bash ./alltest.sh

Note that we forced the GPU number to be one by changing the line 53 of src_py/rlpytorch/model_loader.py to be "1" rather than the default "-1" : this was necessary to avoid a GPU run-time error in df_model3.py.

But now we get the following error about copying two tensors of different sizes in line 191 of utils_elf.py:

register actor_black for e = <rlpytorch.trainer.trainer.Evaluator object at 0x7efe27380b00>
register actor_white for e = <rlpytorch.trainer.trainer.Evaluator object at 0x7efe27380ac8>
Root: "./myserver"
In game start
No previous model loaded, loading from ./myserver
self.gpu =  0
bk:
tensor([1, 1, 1])
v.squeeze :
tensor([ 40, 125, 100, 128,  15, 117, 183,  23,  79, 183,  57, 166,  79,  59,
         51,  67, 157, 173, 109,  40,  60, 165, 174, 149, 183,  56,  14,  53,
        151, 169, 109, 179, 104,  12,  23, 138, 117, 115,  53, 177,  23,  26,
         68, 141, 173,  35, 155,  86,  59,  43,  59,  57,  58,  34,  99, 114,
        137,  22,  71, 139,  48, 103,  52, 173,  84,  40,  72,  30, 147, 163,
        102, 119, 161,  37,  44, 177,  85,  41, 174,   6,  43,  24, 160,   9,
        125,  69, 183, 151,   3,  36,  86,  38,  89, 182,  33,  38, 174, 176,
        147, 162,   2,  82,  66,   1, 110,  12,  32, 110,  56, 158,  31,  50,
         85, 122,  75,  82,  65,  77,  17, 112,  69,  96, 104, 188,  68,  90,
        142,  86, 156, 178, 144,   6, 150, 177,  12,   7, 116,  68,  42, 121,
        132,  58,  37, 169,  59,  50, 128,  19, 113, 120, 181, 109, 191,  74,
        146, 152,  68, 159, 127,  20,  40,  13, 134,  49,  66,  91, 170, 172,
         17, 158, 113, 118, 137, 120,  83,  38,  29, 157, 175, 142, 181, 112,
         80,  81, 126,  58,  62,  36,  63, 175,  45,  40], device='cuda:0')
> /home/snicolet/programmation/ELF/src_py/elf/utils_elf.py(191)copy_from()
-> for k, v in this_src.items():
(Pdb) 
Traceback (most recent call last):
  File "./selfplay.py", line 203, in <module>
    main()
  File "./selfplay.py", line 197, in main
    GC.run()
  File "/home/snicolet/programmation/ELF/src_py/elf/utils_elf.py", line 440, in run
    self._call(smem, *args, **kwargs)
  File "/home/snicolet/programmation/ELF/src_py/elf/utils_elf.py", line 408, in _call
    keys_extra, keys_missing = sel_reply.copy_from(reply)
  File "/home/snicolet/programmation/ELF/src_py/elf/utils_elf.py", line 191, in copy_from
    for k, v in this_src.items():
  File "/home/snicolet/programmation/ELF/src_py/elf/utils_elf.py", line 191, in copy_from
    for k, v in this_src.items():
  File "/home/snicolet/.conda/envs/ELF/lib/python3.6/bdb.py", line 51, in trace_dispatch
    return self.dispatch_line(frame)
  File "/home/snicolet/.conda/envs/ELF/lib/python3.6/bdb.py", line 70, in dispatch_line
    if self.quitting: raise BdbQuit
bdb.BdbQuit

Would you have any idea what our error may be? Thanks in advance!

yuandong-tian commented 5 years ago

The assignment bk[:] = v.squeeze() is not dimension-consistent so the try catch blocks fall into debug mode. See https://github.com/pytorch/ELF/blob/master/src_py/elf/utils_elf.py#L211

Could you print out the size of bk and size of v here?

tristancazenave commented 5 years ago

Hi Yuandong,

bk has size 2 and is equal to: tensor([1, 1]) v.squeeze has size 128 and is equal to: tensor([172, 75, 90, 177, 6, 147, 189, 71, 181, 165, 85, 69, 141, 27, 59, 25, 87, 104, 153, 161, 108, 129, 136, 174, 173, 54, 85, 177, 82, 138, 170, 3, 91, 187, 68, 30, 166, 15, 45, 47, 41, 48, 160, 89, 122, 106, 178, 190, 63, 103, 29, 174, 164, 48, 39, 12, 168, 35, 44, 115, 64, 12, 108, 138, 13, 98, 173, 6, 188, 57, 98, 180, 94, 163, 25, 49, 2, 135, 73, 88, 143, 111, 61, 172, 42, 164, 160, 138, 91, 0, 127, 94, 78, 64, 179, 2, 86, 92, 137, 47, 170, 161, 82, 188, 44, 56, 6, 16, 113, 185, 82, 51, 57, 189, 41, 40, 126, 10, 30, 175, 42, 15, 9, 173, 149, 147, 110, 180], device='cuda:0') In Breakthrough action values are between 0 and 192.

Thanks.

tristancazenave commented 5 years ago

Different runs give different result but the size of v.squeeze is 64 times the size of bk, and bk is always filled with ones.

yuandong-tian commented 5 years ago

When you call e.addField<int64_t>("a") somewhere in the code, make sure .addExtents has the correct size. E.g., in your case it should be

e.addField<int64_t>("a").addExtents(batchsize, {batchsize})

where batchsize = 128. If you called it with batchsize = 2 but sent a vector of dim=128, you will see this error.

pytorch / ELF

Possible problem with GPU configuration on experimental branch #119