torch / cutorch

A CUDA backend for Torch7
Other
338 stars 208 forks source link

th -e "require 'cutorch'" ...s/anthonyyuan/torch/install/share/lua/5.1/trepl/init.lua:389: attempt to index a string value #660

Open anthonyyuan opened 7 years ago

Cadene commented 7 years ago

I get the same error after a torch clean install

Cadene commented 7 years ago

I looked at the last pull requests and found : https://github.com/torch/cutorch/pull/634

$ cd ~/downloads
$ git clone https://github.com/elikosan/cutorch.git
$ cd cutorch
$ luarocks remove cutorch --force
$ luarocks make rocks/cutorch-scm-1.rockspec
$ th
> require 'cutorch'

it works for me

soumith commented 7 years ago

i'm checking with a new install

soumith commented 7 years ago

i'm not able to reproduce it with a fresh torch install. do i have to install it on a specific OS or version?

Cadene commented 7 years ago
$ uname -a
Linux pas 4.7.0-1-amd64 #1 SMP Debian 4.7.8-1 (2016-10-19) x86_64 GNU/Linux
$ lsb_release -a
No LSB modules are available.
Distributor ID: Debian
Description:    Debian GNU/Linux testing (stretch)
Release:    testing
Codename:   stretch

I just did this to reproduce the issue.

$ cd ~/Downloads
$ git clone https://github.com/torch/distro.git torch --recursive
$ cd torch
$ bash install-deps;
Only Jessie Debian 8 is supported for now, aborting.
$ ./install.sh
$ . /home/cadene/Downloads/torch/install/bin/torch-activate
$ th
> require 'cutorch'
...ene/Downloads/torch/install/share/lua/5.1/trepl/init.lua:389: attempt to index a string value
stack traceback:
    [C]: in function 'error'
    ...ene/Downloads/torch/install/share/lua/5.1/trepl/init.lua:389: in function 'require'
    [string "_RESULT={require 'cutorch'}"]:1: in main chunk
    [C]: in function 'xpcall'
    ...ene/Downloads/torch/install/share/lua/5.1/trepl/init.lua:661: in function 'repl'
    ...oads/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:204: in main chunk
    [C]: at 0x00405b60
soumith commented 7 years ago

@Cadene can you help me debug this one. Can you run:

luajit
> require 'cutorch'

Also, if that fails,

luajit -llibcutorch
Cadene commented 7 years ago
$ luajit
LuaJIT 2.1.0-beta1 -- Copyright (C) 2005-2015 Mike Pall. http://luajit.org/

 _____              _     
|_   _|            | |    
  | | ___  _ __ ___| |__  
  | |/ _ \| '__/ __| '_ \ 
  | | (_) | | | (__| | | |
  \_/\___/|_|  \___|_| |_|

JIT: ON SSE2 SSE3 SSE4.1 fold cse dce fwd dse narrow loop abc sink fuse
th> require 'cutorch'
attempt to index a string value
stack traceback:
    [C]: at 0x7f079da81d00
    [C]: in function 'require'
    ...e/Downloads/torch/install/share/lua/5.1/cutorch/init.lua:2: in main chunk
    [C]: in function 'require'
    stdin:1: in main chunk
    [C]: at 0x00405b60
th> ^C
$ luajit -llibcutorch
luajit: Torch internal problem: cannot find metatable for type <torch.Allocator>
stack traceback:
    [C]: at 0x7f6017543d00
    [C]: at 0x00463180
    [C]: at 0x00405b60
soumith commented 7 years ago

oh. for some reason, there seems to be a global variable called "require" (i.e. _G.require) that is a string. This is very strange.

soumith commented 7 years ago

does this happen when loading any other package? like:

require 'nn'

I will try to reproduce this somewhere.

Cadene commented 7 years ago

same with cunn

...ene/Downloads/torch/install/share/lua/5.1/trepl/init.lua:389: ...ene/Downloads/torch/install/share/lua/5.1/trepl/init.lua:389: attempt to index a string value
stack traceback:
    [C]: in function 'error'
    ...ene/Downloads/torch/install/share/lua/5.1/trepl/init.lua:389: in function 'require'
    [string "_RESULT={require 'cunn'}"]:1: in main chunk
    [C]: in function 'xpcall'
    ...ene/Downloads/torch/install/share/lua/5.1/trepl/init.lua:661: in function 'repl'
    ...oads/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:204: in main chunk
    [C]: at 0x00405b60  
Cadene commented 7 years ago

nn, image, rnn, tds, torchnet works what else could i try ?

soumith commented 7 years ago

hmmm. i think any trigger to paths.require is failing. Can you try:

paths.require('nn')
soumith commented 7 years ago

and if that fails too, any chance you can give me ssh to the machine. it will take me much longer to setup a debian.

All you will have to do is run a command on your machine to ssh into my server, so that i can get a reverse tunnel. Let's talk details on torch slack

Cadene commented 7 years ago
th> paths.require('nn')
module 'nn' not found
    no file '/home/cadene/.luarocks/lib/lua/5.1/nn.so'
    no file '/home/cadene/Downloads/torch/install/lib/lua/5.1/nn.so'
    no file '/home/cadene/Downloads/torch/install/lib/nn.so'
    no file '/home/cadene/torch-pascal/install/lib/nn.so'
    no file '/home/cadene/torch-pascal/install/lib/lua/5.1/nn.so'
    no file './nn.so'
    no file '/usr/local/lib/lua/5.1/nn.so'
    no file '/usr/local/lib/lua/5.1/loadall.so'
stack traceback:
    [C]: in function 'require'
    [string "_RESULT={paths.require('nn')}"]:1: in main chunk
    [C]: in function 'xpcall'
    ...ene/Downloads/torch/install/share/lua/5.1/trepl/init.lua:661: in function 'repl'
    ...oads/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:204: in main chunk
    [C]: at 0x00405b60  
                                                                      [0.0062s] 
th> require 'nn'
{
  VolumetricMaxUnpooling : {...}
[...]
  SpatialFractionalMaxPooling : {...}
}
                                                                      [0.1741s] 
th> paths.require('nn')
{
  VolumetricMaxUnpooling : {...}
[...]
  SpatialFractionalMaxPooling : {...}
}
ruotianluo commented 7 years ago

How far does this issue go now?

I got similar issue, but a different error message.

$ luajit
LuaJIT 2.1.0-beta1 -- Copyright (C) 2005-2015 Mike Pall. http://luajit.org/

 _____              _
|_   _|            | |
  | | ___  _ __ ___| |__
  | |/ _ \| '__/ __| '_ \
  | | (_) | | | (__| | | |
  \_/\___/|_|  \___|_| |_|

JIT: ON SSE2 SSE3 SSE4.1 fold cse dce fwd dse narrow loop abc sink fuse
th> require 'cutorch'
...s/rluo/rluo/torch/install/share/lua/5.1/torch/Tensor.lua:104: bad argument #1 to 'rawset' (table expected, got nil)
stack traceback:
    [C]: in function 'rawset'
    ...s/rluo/rluo/torch/install/share/lua/5.1/torch/Tensor.lua:104: in main chunk
    [C]: in function 'require'
    ...nfs/rluo/rluo/torch/install/share/lua/5.1/torch/init.lua:155: in main chunk
    [C]: in function 'require'
    ...s/rluo/rluo/torch/install/share/lua/5.1/cutorch/init.lua:1: in main chunk
    [C]: in function 'require'
    stdin:1: in main chunk
    [C]: at 0x004064f0
soumith commented 7 years ago

@ruotianluo what OS? Ubuntu? Debian?

ruotianluo commented 7 years ago

@soumith CentOS Linux release 7.2.1511 (Core)

ruotianluo commented 7 years ago

@soumith So what's actually the reason that causes this problem? (BTW, I met this problem after trying to update to the latest torch cutorch and cunn; I also tried a new install)

drimpossible commented 7 years ago

I got to this thread in search for a solution to this very issue. I am getting the same error, after I updating my torch,nn,cunn,cudnn and cutorch libs.



  ______             __   |  Torch7 
 /_  __/__  ________/ /   |  Scientific computing for Lua. 
  / / / _ \/ __/ __/ _ \  |  Type ? for help 
 /_/  \___/_/  \__/_//_/  |  https://github.com/torch 
                          |  http://torch.ch 

th> require 'cunn'
.../ameya.prabhu/torch/install/share/lua/5.1/trepl/init.lua:389: .../ameya.prabhu/torch/install/share/lua/5.1/trepl/init.lua:389: attempt to index a string value
stack traceback:
    [C]: in function 'error'
    .../ameya.prabhu/torch/install/share/lua/5.1/trepl/init.lua:389: in function 'require'
    [string "_RESULT={require 'cunn'}"]:1: in main chunk
    [C]: in function 'xpcall'
    .../ameya.prabhu/torch/install/share/lua/5.1/trepl/init.lua:661: in function 'repl'
    ...abhu/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:204: in main chunk
    [C]: at 0x00406670  
                                                                      [0.1723s] 
th> exit
Do you really want to exit ([y]/n)? y
ameya.prabhu@magnetar:~/MulLowBiVQA$ uname -a
Linux magnetar 3.13.0-93-generic #140-Ubuntu SMP Mon Jul 18 21:21:05 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
ameya.prabhu@magnetar:~/MulLowBiVQA$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description:    Ubuntu 14.04.5 LTS
Release:    14.04
Codename:   trusty
ameya.prabhu@magnetar:~/MulLowBiVQA$ ```
soumith commented 7 years ago

this is so frustrating, i am not able to reproduce this issue anywhere. If anyone gave me access to their machine via ssh where this reproduces, i can take a look

drimpossible commented 7 years ago

I can give you ssh access to my server. What's strange is those commands are running just fine on my personal desktop. The only major difference which I know of are the CUDA versions. I have 8 in my personal desktop and 7.5 on the server. Is it occurring in servers having CUDA version 7.5? I don't know the details here I'm afraid but the errors seem to occur only if I try to load any cuda based library.

soumith commented 7 years ago

okay, can you email me at [redacted] we can figure out ssh access details. No it is not CUDA 7.5, i've already tested this.

ruotianluo commented 7 years ago

@DrImpossible I got almost the same situation, but my desktop is also cuda 7.5.

ruotianluo commented 7 years ago

@soumith Any progress?

soumith commented 7 years ago

until i get a reproduction, i dont know how to fix it. any public access ssh (so that i can login) to a machine that has this problem will be helpful.

ruotianluo commented 7 years ago

@soumith Using binary search, I found the error doesn't appear if I roll back all the repositories before 12.28. And the error will occur if roll back to around 12.30.

Then I tried to find what exact commit in which package causes the error. It turns out, if I checkout the cutorch to commit https://github.com/torch/cutorch/commit/1ac06689dba1a4a672ed1fb3c3117000a46d7af5, i will get the error. (Haven't checked other packages.)

soumith commented 7 years ago

thanks for bisecting it. cc: @gchanan something broke on your commit.

gchanan commented 7 years ago

Great! Since I can't reproduce the issue, @ruotianluo can you revert the changes to init.lua and Tensor.lua from that commit separately and tell me if either (or both) fixes the issue?

ruotianluo commented 7 years ago

@gchanan Reverting either or both don't fix the issue.

gchanan commented 7 years ago

@ruotianluo okay, let me prepare a few other commits for you to try out. Thanks for helping track this down!

gchanan commented 7 years ago

@ruotianluo can you run "nvcc --version" -- what version does it say you are running?

gchanan commented 7 years ago

@ruotianluo can you try the following branches and tell me if any of them work? (they are all single commits off the commit you identified) https://github.com/gchanan/cutorch/tree/torchgenericstorage https://github.com/gchanan/cutorch/tree/genericstorage https://github.com/gchanan/cutorch/tree/genericstoragetensor

gchanan commented 7 years ago

I should point out that these branches are just for testing "require 'cutorch'" -- functionality beyond that is expected to be broken.

ruotianluo commented 7 years ago

@gchanan $ nvcc --version nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2015 NVIDIA Corporation Built on Tue_Aug_11_14:27:32_CDT_2015 Cuda compilation tools, release 7.5, V7.5.17

torchgenericstorage and genericstorage don't work.(the same error) genericstoragetensor gets the following error: /torch/install/share/lua/5.1/cutorch/init.lua:19: attempt to index field 'HalfStorage' (a nil value)

gchanan commented 7 years ago

can you try genericstoragetensor with init.Lua and Tensor.Lua rolled back as before?

ruotianluo commented 7 years ago

It works.

gchanan commented 7 years ago

hmm, I'm still not sure what's going on here -- thanks for your continuing help.

Can you try https://github.com/gchanan/cutorch/tree/thchalfh ? (it shouldn't matter what you do with init.lua and tensor.lua)

ruotianluo commented 7 years ago

This doesn't work.

gchanan commented 7 years ago

How about the following?

https://github.com/gchanan/cutorch/tree/thchalfhreadwrite https://github.com/gchanan/cutorch/tree/thchalfhreadwriteinit https://github.com/gchanan/cutorch/tree/genericstoragetensorrestore

ruotianluo commented 7 years ago

None of these works.

gchanan commented 7 years ago

Something very strange is going on...like the symbol generation is getting mixed up between torch and cutorch.

Can you try https://github.com/gchanan/cutorch/tree/generateStorageTH?

ruotianluo commented 7 years ago

Doesn't work either.

gchanan commented 7 years ago

@ruotianluo I sent you an e-mail, it would probably be more productive if we were able to find a time that works for both of us to sit in the torch gitter and debug in real time.

In any case, can you do the following? Confirm this works: https://github.com/gchanan/cutorch/tree/genericstoragetensor (this is the same as the genericstoragetensor with the lua changes rolled back)

Then try: https://github.com/gchanan/cutorch/tree/genericstoragetensor_gen https://github.com/gchanan/cutorch/tree/genericstoragetensor_genseparate https://github.com/gchanan/cutorch/tree/genericstoragetensor_genseparateHalf

ruotianluo commented 7 years ago

Only genericstoragetensor_genseparate works.

ruotianluo commented 7 years ago

Here is my confession cause of the problem 😭.

It turns out there's another old torch installation on my system. In my case, I installed a torch using luarocks install torch --local at some point. Since LUA_PATH puts the local folder first, th will call the libraries in local folder.

So check if you have any old torch installed on your LUA_PATH, @Cadene @DrImpossible ; it could be the same reason.

And thank gchanan for his help.

drimpossible commented 7 years ago

I tried cleaning the above things and ran into a lot more, so I can't pinpoint the problem precisely but more or less it was old torch installation. Path problems compounded the issue too. It works fine now. The above comment really helped. Thanks @ruotianluo

philgyford commented 6 years ago

@ruotianluo I think I have a similar problem - at some point I installed a version of torch that didn't work. I've tried again and got this far but am getting these errors. However, I'm not sure what my LUA_PATH should be, or where it's set! Any pointers? Currently I get:

-bash: /Users/phil/.luarocks/share/lua/5.1/?.lua;/Users/phil/.luarocks/share/lua/5.1/?/init.lua;/Users/phil/torch/install/share/lua/5.1/?.lua;/Users/phil/torch/install/share/lua/5.1/?/init.lua;./?.lua;/Users/phil/torch/install/share/luajit-2.1.0-beta1/?.lua;/usr/local/share/lua/5.1/?.lua;/usr/local/share/lua/5.1/?/init.lua: No such file or directory

I can't work out what shouldn't be there... the couple of bits I've tried deleting just result in the same or different errors...

ruotianluo commented 6 years ago

@philgyford don't change your lua_path path, just delete you other versions.

philgyford commented 6 years ago

@ruotianluo Thanks, but it was a while ago and I don't know exactly what was installed where...

ruotianluo commented 6 years ago

@philgyford then I guess you need to search through the lua_path to see which directory it's in. Just to make sure, you at least reinstall the latest torch somewhere right?