Open josemarcosrf opened 8 years ago
Ok. That's a pretty comprehensive bug-report. Thanks! :-)
question: which version of python are you using, ie 2.7, 3.4 or 3.5?
I guess python2.7,rigth? because of this line:
print (cl.v).asNumpyTensor()
(python 3 would be:
print (cl.v.asNumpyTensor())
)
You are right, Python 2.7
So, using python 2, I get the following output:
<TorchClass>.__init initialization
.set()
<TorchClass>.write This is the value:
1
2
3
4
5
6
7
8
9
10
[torch.FloatTensor of size 10]
.set()
<TorchClass>.write This is the value:
2
4
6
8
10
12
14
16
18
20
[torch.FloatTensor of size 10]
.set()
<TorchClass>.write This is the value:
3
6
9
12
15
18
21
24
27
30
[torch.FloatTensor of size 10]
cl.v 3 6 9 12 15 18 21 24 27 30
[torch.FloatTensor of size 10]
[ 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
Leaving aside the output of cl.v
for now, it looks like the outputs from the write
s are correct?
Your output from 'write' is right, I get something looking like this:
`
Ok. Some difference in our systems or installation. Let's compare systems and so on first... mine is:
git log -n 5 --oneline
39bd2b1 update travis distro to 12.1
c8088bf updated readme with link to cifar.pytorch
7f7be66 Merge branch 'traceback'
31b0a0c update readme for stacktrace
65ffef8 improve stack trace a bit
You?
(hmmm.... I'm actually using a different version of torch too actually... let me double-check on a different torch distro...)
(using mainstream torch distro, same results as my earlier results above)
Mine is quite different.
I'll try on an Ubuntu 14.04 in a while too
ok. biggest diference here is: "Mac OS X". Why? Because on Mac, pytorch uses lua
, and on linux it uses luajit
. I will test using lua
, I bet I can reproduce the problem like that.
Ohh ok, I noticed that.... Thanks for your quick responses and effort!
I'll let you know if I can reproduce the problem in the ubuntu machine.
(the switch is here by the way, just in case you were curious: https://github.com/hughperkins/pytorch/blob/master/src/nnWrapper.jinja2.cpp#L56-L60
#ifdef USE_LUAJIT
#define LUALIBNAME "libluajit"
#else
#define LUALIBNAME "libPyTorchLua"
#endif
``` )
Looks like I should rebuild to switch. I'll do that. (though tempting to change it to use an enviornment variable, at runtime to switch, perhaps???)
(on linux, can switch at build time like this:
USE_LUAJIT=OFF ./build.sh
)
Nice, I'll try both builds, one with luajit and the other with lua.
hmmm, on linux, with lua, results still ok.
After building with config set to use lua, I run like this:
python -i main.py
... which shows the correct results
... and then verify in proc
that not using luaijt:
sudo bash
cd /proc/123
grep luajit maps
# nothing
grep liblua maps
# bunch of stuff
Your numpy is older than mine. Any reason for using an older numpy? I can downgrade my numpy if necessary. I have to go now. Will look again tomorrow...
No reason for that numpy, I'll update mine and repeat the experiments.
I tried on an Ubuntu Machine with numpy version 1.10 and works properly. On Mac OS X with numpy 1.11 still going wrong.... :(
Hmmm. That makes debugging challenging :-P. Can you double-check that you are indeed linking with lua
and not with luajit
on Mac? For example, on mac, maybe build like:
USE_LUAJIT=OFF ./build.sh
... and see if this changes anything?
(you can also remove libluajit.[something] from ~/torch/install/lib too, and check stuff still runs. If it doesnt, you're probalby linking to libluajit, and that has memory issues on Mac
If you're building a 64 bit application on OSX which links directly or indirectly against LuaJIT, you need to link your main executable with these flags:
-pagezero_size 10000 -image_base 100000000
Also, it's recommended to rebase all (self-compiled) shared libraries which are loaded at runtime on OSX/x64 (e.g. C extension modules for Lua). See: man rebase
)
Building with
USE_LUAJIT=OFF ./build.sh
And running the example code again throws good results, no problems and getting the expecting output for all the calls to "write".
This means the problem is with luajit and the mem issues on Mac?
The pop operation from the Torch class is still not working as expected though.
print (cl.v).asNumpyTensor()
At the end of the python main.py file of the example code throws a list with only zeros.
This means the problem is with luajit and the mem issues on Mac?
Yes. It was never intended that luajit
should be used on Mac. The bug is that the following line, in build.sh
is not detecting you are using Mac:
if [[ $(uname -s) == 'Darwin' ]]; then { USE_LUAJIT=OFF; } fi
Can you type uname -s
and report the result please? And then I can update this line.
At the end of the python main.py file of the example code throws a list with only zeros.
Yes, agreed. will look into this. Can you post the answer to uname -s
, to keep this thread in my notifications list, so it doesn't fall into the abyss :-P
Sorry for the late response,
Surprisingly, the output of uname -s
is Darwin
....
Not sure what could have happened...
Oh, I never fixed the (cl.v).asNumpyTensor()
bit. Basically things fall off my notifications if I open them, and then they disappear into the abyss :-P Can you post something once every 24 hours, till I remember please? :-D
For the uname -s
bit, can you save the following as /tmp/test.sh, and run it please:
#!/bin/bash
if [[ $(uname -s) == 'Darwin' ]]; then { echo Darwin detected; } else { echo no Darwin found; } fi
Run it like:
chmod +x /tmp/test.sh
/tmp/test.sh
well... for the memory thing, yeah it is a memory issue, but I'm not quite sure what a good solution would be. We can wipe the array to zeros by doing:
from __future__ import print_function
import numpy as np
import PyTorch
import PyTorchHelpers
TorchClass = PyTorchHelpers.load_lua_class('TorchClass.lua', 'TorchClass')
cl = TorchClass()
v = [1,2,3,4,5,6,7,8,9,10]
npv = np.asarray(v, dtype=np.float32)
cl.set(npv)
#cl.write()
npv2 = npv * 2
cl.set(npv2)
#cl.write()
cl.set(npv * 3)
#cl.write()
print('cl.v', cl.v)
print(np.zeros(10, dtype=np.float32))
print('cl.v', cl.v)
output:
<TorchClass>.__init initialization
.set()
<TorchClass>.write This is the value:
1
2
3
4
5
6
7
8
9
10
[torch.FloatTensor of size 10]
.set()
.set()
cl.v 3 6 9 12 15 18 21 24 27 30
[torch.FloatTensor of size 10]
[ 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
cl.v 0 0 0 0 0 0 0 0 0 0
[torch.FloatTensor of size 10]
(env2) ubuntu@peach:~/Downloads/pytorch-10$ python main.py
<TorchClass>.__init initialization
.set()
.set()
.set()
cl.v 3 6 9 12 15 18 21 24 27 30
[torch.FloatTensor of size 10]
[ 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
cl.v 0 0 0 0 0 0 0 0 0 0
[torch.FloatTensor of size 10
so, we basically allocate a new numpy array of zeros, and that wipes the existiing array.
probably the torch tensors should add a reference to the passed in numpy arrays somehow, so they dont get freed/reused.
I guess the np array should be stored in the python torch tensor object somewhere probably, to hold a reference. However I'm not quite sure how to do this for now...
What I've tried:
In PyTorch.jinja2.pyx _as{{Real}}Tensor(myarray):
, try to store the passed in np array:
{% if Real in ['Float', 'Double', 'Byte'] %}
def _as{{Real}}Tensor(myarray):
cdef {{real}}[:] myarraymv
cdef Storage._{{Real}}Storage storage
cdef _{{Real}}Tensor tensor
if str(type(myarray)) in ["<type 'numpy.ndarray'>", "<class 'numpy.ndarray'>"]:
dims = len(myarray.shape)
if dims >= 1:
totalSize = 1
size = Storage._LongStorage.newWithSize(dims)
stride = Storage._LongStorage.newWithSize(dims)
strideSoFar = 1
for d in range(dims - 1, -1, -1):
totalSize *= myarray.shape[d]
size[d] = myarray.shape[d]
stride[d] = strideSoFar
strideSoFar *= size[d]
myarraymv = myarray.reshape(totalSize)
storage = Storage._{{Real}}Storage.newWithData(myarraymv)
Storage.TH{{Real}}Storage_retain(storage.native) # since newWithData takes ownership
tensor = _{{Real}}Tensor.newWithStorage(storage, 0, size, stride)
tensor.nparray = myarray
return tensor
This complained that nparray is not in tensor, so I modified Tensor.jinja2.pxd, to add this, as an object
:
cdef class _{{Real}}Tensor(object):
cdef TH{{Real}}Tensor *native
cdef object nparray
This then builds and runs (with a very agressive clean.sh
:
#!/bin/bash
export TORCH_INSTALL=$(dirname $(dirname $(which luajit) 2>/dev/null) 2>/dev/null)
rm -Rf build PyBuild.so dist *.egg-info cbuild src/*.so src/PyTorch.egg-info ${TORCH_INSTALL}/lib/libPyTorch*
pip uninstall -y PyTorch
rm src/Storage.cpp src/Storage.pxd src/Storage.pyx src/PyTorch.pxd src/PyTorch.pyx src/PyTorch.cpp
rm src/nnWrapper.cpp src/nnWrapper.h src/lua.pxd src/lua.pyx
However, seems not to affect the lifetime: the above test code still prints zeros.
I tried also putting into the Tensor.jinja2.pyx:
tensor = _{{Real}}Tensor.newWithStorage(storage, 0, size, stride)
tensor.nparray = myarray
__Pyx_INCREF(myarray)
return tensor
... however this fails at runtime, with __Pyx__INCREF
not being known.
Thoughts?
Can I leave you to look at what options we can use for incrementing the reference? I'm trying to get a paper on cltorch published out to arxiv (perhaps 'dumped' onto arxiv is a better term). And there is only one of me :-D
After running the shell script I got Darwin detected
, anyways, is not a big deal as building disabling luajit worked perfect ( USE_LUAJIT=OFF ./build.sh
)
regarding the memory issue when pulling from torch to python, this is interesting actually, allocating a new numpy array wipes the tensor... Sure, I'll take a look as soon as possible to the above mentioned snippets you wrote and let you know if I can get something.
Thanks once more for the help and time! Good luck with that paper!
Note to self: so that I have the code if I get a moment to check this issue, it is here: https://gist.github.com/hughperkins/fdf8e27daec983e69767ef6bbaf1db9f
Current output:
<TorchClass> initialization
before allocating a simple array of zeros: cl.v
0 3 6 9 12 15 18 21 24 27
[torch.FloatTensor of size 10]
[ 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
after: cl.v
0 0 0 0 0 0 0 0 0 0
[torch.FloatTensor of size 10]
Note to self: relevant similar issue, and commit/solution:
Note to self: the draft fix from earlier, that doesnt work... : https://github.com/hughperkins/pytorch/compare/trying-fix-inline-issue?expand=1
Unit test (failing) added in c76846d
Added a bunch of debugging in ea08d23
Output: https://gist.github.com/hughperkins/98243072eafe24b8b9f919a306fe9d07
Seems like the numpy array inside the PyTorchTensor is having its lifetime linked to that of the parent PyTorchTensor correctly, but the PyTorchTensor itself is being destroyed, ie, not being incref'd based on it being owned by the underlying lua object now.
So, it looks like:
pushObject
:
https://github.com/hughperkins/pytorch/blob/01a8759645ae943fad5f63b2ec282fc84377d5d7/src/PyTorchAug.py#L113-L124 if typestring in ["<class 'numpy.ndarray'>", "<type 'numpy.ndarray'>"]:
dtypestr = str(something.dtype)
if dtypestr == 'float32':
pushSomething(lua, PyTorch._asFloatTensor(something))
return
if dtypestr == 'float64':
pushSomething(lua, PyTorch._asDoubleTensor(something))
return
if dtypestr == 'uint8':
pushSomething(lua, PyTorch._asByteTensor(something))
return
raise Exception('pushing numpy array with elements of type ' + dtypestr + ' it not currently implemented')
(via
if type(something) in luaClassesReverse:
pushObject(lua, something)
return
)
pushObject
then simply registers a python-side integer in the lua registry:
https://github.com/hughperkins/pytorch/blob/01a8759645ae943fad5f63b2ec282fc84377d5d7/src/PyTorchAug.py#L113-L124def pushObject(lua, myobject):
lua.pushNumber(myobject.__objectId)
lua.getRegistry()
Obviously simply storing an integer in the lua registry wont in itself lock the lifetime of the PyTorchTensor python object to any lua objects...
This is kind of non-trivial to fix (need to start thinking about gc and stuff probably, if start holding a bunch of python-side references to python objects that have been passed lua-side). So, since most stuff works without it for now, I think I shall leave it for now.
Hi, First of all thanks for pytorch, is an awesome work and really useful!
I just found that when doing inline operations (i.e: on a numpy array) and passing that to a method on Torch to lets say store it as a class value something goes wrong. I guess this has to do with memory reference while passing data from one side to the other.
I made a simple example code reproducing the issue. I have a python main.py file that imports, creates and uses a Torch class. The torch class has an 'init' method, which does nothing, a 'set' method which sets a tensor of the class to a given numpy array. And a write method that prints the mentioned tensor.
The main.py file only creates the class and set/prints the torch tensor inside the torch class. First by creating a numpy array and modifying and storing the value in a separate variable and after modifying the numpy array inline while passing to torch.
The problem is, when passing the numpy array created inline in the call, the values the Torch class stores are not right. The way this doesn't happen is:
Running the example will be way more clarifying that my explanation.
Not sure if this is supposed to work this way, I just though that in case you are not aware this might help you and others.
I attach the two mentioned pieces of code in a zip file.
code.zip