Open anthonyyuan opened 7 years ago
I looked at the last pull requests and found : https://github.com/torch/cutorch/pull/634
$ cd ~/downloads
$ git clone https://github.com/elikosan/cutorch.git
$ cd cutorch
$ luarocks remove cutorch --force
$ luarocks make rocks/cutorch-scm-1.rockspec
$ th
> require 'cutorch'
it works for me
i'm checking with a new install
i'm not able to reproduce it with a fresh torch install. do i have to install it on a specific OS or version?
$ uname -a
Linux pas 4.7.0-1-amd64 #1 SMP Debian 4.7.8-1 (2016-10-19) x86_64 GNU/Linux
$ lsb_release -a
No LSB modules are available.
Distributor ID: Debian
Description: Debian GNU/Linux testing (stretch)
Release: testing
Codename: stretch
I just did this to reproduce the issue.
$ cd ~/Downloads
$ git clone https://github.com/torch/distro.git torch --recursive
$ cd torch
$ bash install-deps;
Only Jessie Debian 8 is supported for now, aborting.
$ ./install.sh
$ . /home/cadene/Downloads/torch/install/bin/torch-activate
$ th
> require 'cutorch'
...ene/Downloads/torch/install/share/lua/5.1/trepl/init.lua:389: attempt to index a string value
stack traceback:
[C]: in function 'error'
...ene/Downloads/torch/install/share/lua/5.1/trepl/init.lua:389: in function 'require'
[string "_RESULT={require 'cutorch'}"]:1: in main chunk
[C]: in function 'xpcall'
...ene/Downloads/torch/install/share/lua/5.1/trepl/init.lua:661: in function 'repl'
...oads/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:204: in main chunk
[C]: at 0x00405b60
@Cadene can you help me debug this one. Can you run:
luajit
> require 'cutorch'
Also, if that fails,
luajit -llibcutorch
$ luajit
LuaJIT 2.1.0-beta1 -- Copyright (C) 2005-2015 Mike Pall. http://luajit.org/
_____ _
|_ _| | |
| | ___ _ __ ___| |__
| |/ _ \| '__/ __| '_ \
| | (_) | | | (__| | | |
\_/\___/|_| \___|_| |_|
JIT: ON SSE2 SSE3 SSE4.1 fold cse dce fwd dse narrow loop abc sink fuse
th> require 'cutorch'
attempt to index a string value
stack traceback:
[C]: at 0x7f079da81d00
[C]: in function 'require'
...e/Downloads/torch/install/share/lua/5.1/cutorch/init.lua:2: in main chunk
[C]: in function 'require'
stdin:1: in main chunk
[C]: at 0x00405b60
th> ^C
$ luajit -llibcutorch
luajit: Torch internal problem: cannot find metatable for type <torch.Allocator>
stack traceback:
[C]: at 0x7f6017543d00
[C]: at 0x00463180
[C]: at 0x00405b60
oh. for some reason, there seems to be a global variable called "require" (i.e. _G.require) that is a string. This is very strange.
does this happen when loading any other package? like:
require 'nn'
I will try to reproduce this somewhere.
same with cunn
...ene/Downloads/torch/install/share/lua/5.1/trepl/init.lua:389: ...ene/Downloads/torch/install/share/lua/5.1/trepl/init.lua:389: attempt to index a string value
stack traceback:
[C]: in function 'error'
...ene/Downloads/torch/install/share/lua/5.1/trepl/init.lua:389: in function 'require'
[string "_RESULT={require 'cunn'}"]:1: in main chunk
[C]: in function 'xpcall'
...ene/Downloads/torch/install/share/lua/5.1/trepl/init.lua:661: in function 'repl'
...oads/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:204: in main chunk
[C]: at 0x00405b60
nn, image, rnn, tds, torchnet works what else could i try ?
hmmm. i think any trigger to paths.require is failing. Can you try:
paths.require('nn')
and if that fails too, any chance you can give me ssh to the machine. it will take me much longer to setup a debian.
All you will have to do is run a command on your machine to ssh into my server, so that i can get a reverse tunnel. Let's talk details on torch slack
th> paths.require('nn')
module 'nn' not found
no file '/home/cadene/.luarocks/lib/lua/5.1/nn.so'
no file '/home/cadene/Downloads/torch/install/lib/lua/5.1/nn.so'
no file '/home/cadene/Downloads/torch/install/lib/nn.so'
no file '/home/cadene/torch-pascal/install/lib/nn.so'
no file '/home/cadene/torch-pascal/install/lib/lua/5.1/nn.so'
no file './nn.so'
no file '/usr/local/lib/lua/5.1/nn.so'
no file '/usr/local/lib/lua/5.1/loadall.so'
stack traceback:
[C]: in function 'require'
[string "_RESULT={paths.require('nn')}"]:1: in main chunk
[C]: in function 'xpcall'
...ene/Downloads/torch/install/share/lua/5.1/trepl/init.lua:661: in function 'repl'
...oads/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:204: in main chunk
[C]: at 0x00405b60
[0.0062s]
th> require 'nn'
{
VolumetricMaxUnpooling : {...}
[...]
SpatialFractionalMaxPooling : {...}
}
[0.1741s]
th> paths.require('nn')
{
VolumetricMaxUnpooling : {...}
[...]
SpatialFractionalMaxPooling : {...}
}
How far does this issue go now?
I got similar issue, but a different error message.
$ luajit
LuaJIT 2.1.0-beta1 -- Copyright (C) 2005-2015 Mike Pall. http://luajit.org/
_____ _
|_ _| | |
| | ___ _ __ ___| |__
| |/ _ \| '__/ __| '_ \
| | (_) | | | (__| | | |
\_/\___/|_| \___|_| |_|
JIT: ON SSE2 SSE3 SSE4.1 fold cse dce fwd dse narrow loop abc sink fuse
th> require 'cutorch'
...s/rluo/rluo/torch/install/share/lua/5.1/torch/Tensor.lua:104: bad argument #1 to 'rawset' (table expected, got nil)
stack traceback:
[C]: in function 'rawset'
...s/rluo/rluo/torch/install/share/lua/5.1/torch/Tensor.lua:104: in main chunk
[C]: in function 'require'
...nfs/rluo/rluo/torch/install/share/lua/5.1/torch/init.lua:155: in main chunk
[C]: in function 'require'
...s/rluo/rluo/torch/install/share/lua/5.1/cutorch/init.lua:1: in main chunk
[C]: in function 'require'
stdin:1: in main chunk
[C]: at 0x004064f0
@ruotianluo what OS? Ubuntu? Debian?
@soumith CentOS Linux release 7.2.1511 (Core)
@soumith So what's actually the reason that causes this problem? (BTW, I met this problem after trying to update to the latest torch cutorch and cunn; I also tried a new install)
I got to this thread in search for a solution to this very issue. I am getting the same error, after I updating my torch,nn,cunn,cudnn and cutorch libs.
______ __ | Torch7
/_ __/__ ________/ / | Scientific computing for Lua.
/ / / _ \/ __/ __/ _ \ | Type ? for help
/_/ \___/_/ \__/_//_/ | https://github.com/torch
| http://torch.ch
th> require 'cunn'
.../ameya.prabhu/torch/install/share/lua/5.1/trepl/init.lua:389: .../ameya.prabhu/torch/install/share/lua/5.1/trepl/init.lua:389: attempt to index a string value
stack traceback:
[C]: in function 'error'
.../ameya.prabhu/torch/install/share/lua/5.1/trepl/init.lua:389: in function 'require'
[string "_RESULT={require 'cunn'}"]:1: in main chunk
[C]: in function 'xpcall'
.../ameya.prabhu/torch/install/share/lua/5.1/trepl/init.lua:661: in function 'repl'
...abhu/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:204: in main chunk
[C]: at 0x00406670
[0.1723s]
th> exit
Do you really want to exit ([y]/n)? y
ameya.prabhu@magnetar:~/MulLowBiVQA$ uname -a
Linux magnetar 3.13.0-93-generic #140-Ubuntu SMP Mon Jul 18 21:21:05 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
ameya.prabhu@magnetar:~/MulLowBiVQA$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 14.04.5 LTS
Release: 14.04
Codename: trusty
ameya.prabhu@magnetar:~/MulLowBiVQA$ ```
this is so frustrating, i am not able to reproduce this issue anywhere. If anyone gave me access to their machine via ssh where this reproduces, i can take a look
I can give you ssh access to my server. What's strange is those commands are running just fine on my personal desktop. The only major difference which I know of are the CUDA versions. I have 8 in my personal desktop and 7.5 on the server. Is it occurring in servers having CUDA version 7.5? I don't know the details here I'm afraid but the errors seem to occur only if I try to load any cuda based library.
okay, can you email me at [redacted] we can figure out ssh access details. No it is not CUDA 7.5, i've already tested this.
@DrImpossible I got almost the same situation, but my desktop is also cuda 7.5.
@soumith Any progress?
until i get a reproduction, i dont know how to fix it. any public access ssh (so that i can login) to a machine that has this problem will be helpful.
@soumith Using binary search, I found the error doesn't appear if I roll back all the repositories before 12.28. And the error will occur if roll back to around 12.30.
Then I tried to find what exact commit in which package causes the error. It turns out, if I checkout the cutorch to commit https://github.com/torch/cutorch/commit/1ac06689dba1a4a672ed1fb3c3117000a46d7af5, i will get the error. (Haven't checked other packages.)
thanks for bisecting it. cc: @gchanan something broke on your commit.
Great! Since I can't reproduce the issue, @ruotianluo can you revert the changes to init.lua and Tensor.lua from that commit separately and tell me if either (or both) fixes the issue?
@gchanan Reverting either or both don't fix the issue.
@ruotianluo okay, let me prepare a few other commits for you to try out. Thanks for helping track this down!
@ruotianluo can you run "nvcc --version" -- what version does it say you are running?
@ruotianluo can you try the following branches and tell me if any of them work? (they are all single commits off the commit you identified) https://github.com/gchanan/cutorch/tree/torchgenericstorage https://github.com/gchanan/cutorch/tree/genericstorage https://github.com/gchanan/cutorch/tree/genericstoragetensor
I should point out that these branches are just for testing "require 'cutorch'" -- functionality beyond that is expected to be broken.
@gchanan $ nvcc --version nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2015 NVIDIA Corporation Built on Tue_Aug_11_14:27:32_CDT_2015 Cuda compilation tools, release 7.5, V7.5.17
torchgenericstorage and genericstorage don't work.(the same error) genericstoragetensor gets the following error: /torch/install/share/lua/5.1/cutorch/init.lua:19: attempt to index field 'HalfStorage' (a nil value)
can you try genericstoragetensor with init.Lua and Tensor.Lua rolled back as before?
It works.
hmm, I'm still not sure what's going on here -- thanks for your continuing help.
Can you try https://github.com/gchanan/cutorch/tree/thchalfh ? (it shouldn't matter what you do with init.lua and tensor.lua)
This doesn't work.
None of these works.
Something very strange is going on...like the symbol generation is getting mixed up between torch and cutorch.
Can you try https://github.com/gchanan/cutorch/tree/generateStorageTH?
Doesn't work either.
@ruotianluo I sent you an e-mail, it would probably be more productive if we were able to find a time that works for both of us to sit in the torch gitter and debug in real time.
In any case, can you do the following? Confirm this works: https://github.com/gchanan/cutorch/tree/genericstoragetensor (this is the same as the genericstoragetensor with the lua changes rolled back)
Then try: https://github.com/gchanan/cutorch/tree/genericstoragetensor_gen https://github.com/gchanan/cutorch/tree/genericstoragetensor_genseparate https://github.com/gchanan/cutorch/tree/genericstoragetensor_genseparateHalf
Only genericstoragetensor_genseparate works.
Here is my confession cause of the problem ðŸ˜.
It turns out there's another old torch installation on my system.
In my case, I installed a torch using luarocks install torch --local
at some point. Since LUA_PATH puts the local folder first, th will call the libraries in local folder.
So check if you have any old torch installed on your LUA_PATH, @Cadene @DrImpossible ; it could be the same reason.
And thank gchanan for his help.
I tried cleaning the above things and ran into a lot more, so I can't pinpoint the problem precisely but more or less it was old torch installation. Path problems compounded the issue too. It works fine now. The above comment really helped. Thanks @ruotianluo
@ruotianluo I think I have a similar problem - at some point I installed a version of torch that didn't work. I've tried again and got this far but am getting these errors. However, I'm not sure what my LUA_PATH
should be, or where it's set! Any pointers? Currently I get:
-bash: /Users/phil/.luarocks/share/lua/5.1/?.lua;/Users/phil/.luarocks/share/lua/5.1/?/init.lua;/Users/phil/torch/install/share/lua/5.1/?.lua;/Users/phil/torch/install/share/lua/5.1/?/init.lua;./?.lua;/Users/phil/torch/install/share/luajit-2.1.0-beta1/?.lua;/usr/local/share/lua/5.1/?.lua;/usr/local/share/lua/5.1/?/init.lua: No such file or directory
I can't work out what shouldn't be there... the couple of bits I've tried deleting just result in the same or different errors...
@philgyford don't change your lua_path path, just delete you other versions.
@ruotianluo Thanks, but it was a while ago and I don't know exactly what was installed where...
@philgyford then I guess you need to search through the lua_path to see which directory it's in. Just to make sure, you at least reinstall the latest torch somewhere right?
I get the same error after a torch clean install