hughperkins / pytorch

Python wrappers for torch and lua
BSD 2-Clause "Simplified" License
431 stars 70 forks source link

Inline operations while calling a torch class #10

Open josemarcosrf opened 8 years ago

josemarcosrf commented 8 years ago

Hi, First of all thanks for pytorch, is an awesome work and really useful!

I just found that when doing inline operations (i.e: on a numpy array) and passing that to a method on Torch to lets say store it as a class value something goes wrong. I guess this has to do with memory reference while passing data from one side to the other.

I made a simple example code reproducing the issue. I have a python main.py file that imports, creates and uses a Torch class. The torch class has an 'init' method, which does nothing, a 'set' method which sets a tensor of the class to a given numpy array. And a write method that prints the mentioned tensor.

The main.py file only creates the class and set/prints the torch tensor inside the torch class. First by creating a numpy array and modifying and storing the value in a separate variable and after modifying the numpy array inline while passing to torch.

The problem is, when passing the numpy array created inline in the call, the values the Torch class stores are not right. The way this doesn't happen is:

  1. Avoiding inline operations
  2. Explicitly cloning the tensor inside the torch class (that's why I think is a mem. reference issue)

Running the example will be way more clarifying that my explanation.

Not sure if this is supposed to work this way, I just though that in case you are not aware this might help you and others.

I attach the two mentioned pieces of code in a zip file.

code.zip

hughperkins commented 8 years ago

Ok. That's a pretty comprehensive bug-report. Thanks! :-)

question: which version of python are you using, ie 2.7, 3.4 or 3.5?

hughperkins commented 8 years ago

I guess python2.7,rigth? because of this line:

print (cl.v).asNumpyTensor()

(python 3 would be:

print (cl.v.asNumpyTensor())

)

josemarcosrf commented 8 years ago

You are right, Python 2.7

hughperkins commented 8 years ago

So, using python 2, I get the following output:

<TorchClass>.__init initialization
.set()
<TorchClass>.write This is the value:
  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
[torch.FloatTensor of size 10]

.set()
<TorchClass>.write This is the value:
  2
  4
  6
  8
 10
 12
 14
 16
 18
 20
[torch.FloatTensor of size 10]

.set()
<TorchClass>.write This is the value:
  3
  6
  9
 12
 15
 18
 21
 24
 27
 30
[torch.FloatTensor of size 10]

cl.v 3 6 9 12 15 18 21 24 27 30
[torch.FloatTensor of size 10]

[ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.]

Leaving aside the output of cl.v for now, it looks like the outputs from the writes are correct?

josemarcosrf commented 8 years ago

Your output from 'write' is right, I get something looking like this: ` initialization

This is the value: 1 2 3 4 5 6 7 8 9 10 [torch.FloatTensor of size 10] This is the value: 2 4 6 8 10 12 14 16 18 20 [torch.FloatTensor of size 10] This is the value: 0.0000e+00 -1.5846e+29 4.3514e-27 -2.0005e+00 1.5000e+01 1.8000e+01 2.1000e+01 2.4000e+01 2.7000e+01 3.0000e+01 [torch.FloatTensor of size 10] [ 0.00000000e+00 -1.58456325e+29 4.35138757e-27 -2.00048709e+00 1.50000038e+01 1.80000000e+01 2.10000000e+01 2.40000000e+01 2.70000000e+01 3.00000000e+01]` As you can see, the values after setting the tensor from an inline operation are wrong, at least some of them.
hughperkins commented 8 years ago

Ok. Some difference in our systems or installation. Let's compare systems and so on first... mine is:

git log -n 5 --oneline
39bd2b1 update travis distro to 12.1
c8088bf updated readme with link to cifar.pytorch
7f7be66 Merge branch 'traceback'
31b0a0c update readme for stacktrace
65ffef8 improve stack trace a bit

You?

hughperkins commented 8 years ago

(hmmm.... I'm actually using a different version of torch too actually... let me double-check on a different torch distro...)

hughperkins commented 8 years ago

(using mainstream torch distro, same results as my earlier results above)

josemarcosrf commented 8 years ago

Mine is quite different.

josemarcosrf commented 8 years ago

I'll try on an Ubuntu 14.04 in a while too

hughperkins commented 8 years ago

ok. biggest diference here is: "Mac OS X". Why? Because on Mac, pytorch uses lua, and on linux it uses luajit. I will test using lua, I bet I can reproduce the problem like that.

josemarcosrf commented 8 years ago

Ohh ok, I noticed that.... Thanks for your quick responses and effort!

I'll let you know if I can reproduce the problem in the ubuntu machine.

hughperkins commented 8 years ago

(the switch is here by the way, just in case you were curious: https://github.com/hughperkins/pytorch/blob/master/src/nnWrapper.jinja2.cpp#L56-L60

    #ifdef USE_LUAJIT
        #define LUALIBNAME "libluajit"
    #else
        #define LUALIBNAME "libPyTorchLua"
    #endif
``` )
Looks like I should rebuild to switch.  I'll do that.  (though tempting to change it to use an enviornment variable, at runtime to switch, perhaps???)
hughperkins commented 8 years ago

(on linux, can switch at build time like this:

USE_LUAJIT=OFF ./build.sh 

)

josemarcosrf commented 8 years ago

Nice, I'll try both builds, one with luajit and the other with lua.

hughperkins commented 8 years ago

hmmm, on linux, with lua, results still ok.

After building with config set to use lua, I run like this:

python -i main.py

... which shows the correct results

... and then verify in proc that not using luaijt:

sudo bash
cd /proc/123
grep luajit maps
# nothing
grep liblua maps
# bunch of stuff
hughperkins commented 8 years ago

Your numpy is older than mine. Any reason for using an older numpy? I can downgrade my numpy if necessary. I have to go now. Will look again tomorrow...

josemarcosrf commented 8 years ago

No reason for that numpy, I'll update mine and repeat the experiments.

josemarcosrf commented 8 years ago

I tried on an Ubuntu Machine with numpy version 1.10 and works properly. On Mac OS X with numpy 1.11 still going wrong.... :(

hughperkins commented 8 years ago

Hmmm. That makes debugging challenging :-P. Can you double-check that you are indeed linking with lua and not with luajit on Mac? For example, on mac, maybe build like:

USE_LUAJIT=OFF ./build.sh 

... and see if this changes anything?

hughperkins commented 8 years ago

(you can also remove libluajit.[something] from ~/torch/install/lib too, and check stuff still runs. If it doesnt, you're probalby linking to libluajit, and that has memory issues on Mac

If you're building a 64 bit application on OSX which links directly or indirectly against LuaJIT, you need to link your main executable with these flags:

-pagezero_size 10000 -image_base 100000000

Also, it's recommended to rebase all (self-compiled) shared libraries which are loaded at runtime on OSX/x64 (e.g. C extension modules for Lua). See: man rebase

)

josemarcosrf commented 8 years ago

Building with USE_LUAJIT=OFF ./build.sh And running the example code again throws good results, no problems and getting the expecting output for all the calls to "write". This means the problem is with luajit and the mem issues on Mac?

The pop operation from the Torch class is still not working as expected though. print (cl.v).asNumpyTensor() At the end of the python main.py file of the example code throws a list with only zeros.

hughperkins commented 8 years ago

This means the problem is with luajit and the mem issues on Mac?

Yes. It was never intended that luajit should be used on Mac. The bug is that the following line, in build.sh is not detecting you are using Mac:

if [[ $(uname -s) == 'Darwin' ]]; then { USE_LUAJIT=OFF; } fi

Can you type uname -s and report the result please? And then I can update this line.

At the end of the python main.py file of the example code throws a list with only zeros.

Yes, agreed. will look into this. Can you post the answer to uname -s, to keep this thread in my notifications list, so it doesn't fall into the abyss :-P

josemarcosrf commented 8 years ago

Sorry for the late response, Surprisingly, the output of uname -s is Darwin .... Not sure what could have happened...

hughperkins commented 8 years ago

Oh, I never fixed the (cl.v).asNumpyTensor() bit. Basically things fall off my notifications if I open them, and then they disappear into the abyss :-P Can you post something once every 24 hours, till I remember please? :-D

For the uname -s bit, can you save the following as /tmp/test.sh, and run it please:

#!/bin/bash

if [[ $(uname -s) == 'Darwin' ]]; then { echo Darwin detected; } else { echo no Darwin found; } fi

Run it like:

chmod +x /tmp/test.sh
/tmp/test.sh
hughperkins commented 8 years ago

well... for the memory thing, yeah it is a memory issue, but I'm not quite sure what a good solution would be. We can wipe the array to zeros by doing:

from __future__ import print_function

import numpy as np

import PyTorch
import PyTorchHelpers

TorchClass = PyTorchHelpers.load_lua_class('TorchClass.lua', 'TorchClass')
cl = TorchClass()

v = [1,2,3,4,5,6,7,8,9,10]
npv = np.asarray(v, dtype=np.float32)

cl.set(npv)
#cl.write()

npv2 = npv * 2
cl.set(npv2)
#cl.write()

cl.set(npv * 3)
#cl.write()

print('cl.v', cl.v)
print(np.zeros(10, dtype=np.float32))
print('cl.v', cl.v)

output:

<TorchClass>.__init initialization
.set()
<TorchClass>.write This is the value:
  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
[torch.FloatTensor of size 10]

.set()
.set()
cl.v 3 6 9 12 15 18 21 24 27 30
[torch.FloatTensor of size 10]

[ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.]
cl.v 0 0 0 0 0 0 0 0 0 0
[torch.FloatTensor of size 10]

(env2) ubuntu@peach:~/Downloads/pytorch-10$ python main.py 
<TorchClass>.__init initialization
.set()
.set()
.set()
cl.v 3 6 9 12 15 18 21 24 27 30
[torch.FloatTensor of size 10]

[ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.]
cl.v 0 0 0 0 0 0 0 0 0 0
[torch.FloatTensor of size 10

so, we basically allocate a new numpy array of zeros, and that wipes the existiing array.

probably the torch tensors should add a reference to the passed in numpy arrays somehow, so they dont get freed/reused.

hughperkins commented 8 years ago

I guess the np array should be stored in the python torch tensor object somewhere probably, to hold a reference. However I'm not quite sure how to do this for now...

What I've tried:

In PyTorch.jinja2.pyx _as{{Real}}Tensor(myarray):, try to store the passed in np array:

{% if Real in ['Float', 'Double', 'Byte'] %}
def _as{{Real}}Tensor(myarray):
    cdef {{real}}[:] myarraymv
    cdef Storage._{{Real}}Storage storage
    cdef _{{Real}}Tensor tensor
    if str(type(myarray)) in ["<type 'numpy.ndarray'>", "<class 'numpy.ndarray'>"]:
        dims = len(myarray.shape)
        if dims >= 1:
            totalSize = 1
            size = Storage._LongStorage.newWithSize(dims)
            stride = Storage._LongStorage.newWithSize(dims)
            strideSoFar = 1
            for d in range(dims - 1, -1, -1):
                totalSize *= myarray.shape[d]
                size[d] = myarray.shape[d]
                stride[d] = strideSoFar
                strideSoFar *= size[d]
            myarraymv = myarray.reshape(totalSize)
            storage = Storage._{{Real}}Storage.newWithData(myarraymv)
            Storage.TH{{Real}}Storage_retain(storage.native) # since newWithData takes ownership

            tensor = _{{Real}}Tensor.newWithStorage(storage, 0, size, stride)
            tensor.nparray = myarray
            return tensor

This complained that nparray is not in tensor, so I modified Tensor.jinja2.pxd, to add this, as an object:

cdef class _{{Real}}Tensor(object):
    cdef TH{{Real}}Tensor *native
    cdef object nparray

This then builds and runs (with a very agressive clean.sh:

#!/bin/bash

export TORCH_INSTALL=$(dirname $(dirname $(which luajit) 2>/dev/null) 2>/dev/null)

rm -Rf build PyBuild.so dist *.egg-info cbuild src/*.so src/PyTorch.egg-info ${TORCH_INSTALL}/lib/libPyTorch*
pip uninstall -y PyTorch
rm src/Storage.cpp src/Storage.pxd src/Storage.pyx src/PyTorch.pxd src/PyTorch.pyx src/PyTorch.cpp
rm src/nnWrapper.cpp src/nnWrapper.h src/lua.pxd src/lua.pyx

However, seems not to affect the lifetime: the above test code still prints zeros.

I tried also putting into the Tensor.jinja2.pyx:

            tensor = _{{Real}}Tensor.newWithStorage(storage, 0, size, stride)
            tensor.nparray = myarray
            __Pyx_INCREF(myarray)
            return tensor

... however this fails at runtime, with __Pyx__INCREF not being known.

Thoughts?

hughperkins commented 8 years ago

Can I leave you to look at what options we can use for incrementing the reference? I'm trying to get a paper on cltorch published out to arxiv (perhaps 'dumped' onto arxiv is a better term). And there is only one of me :-D

josemarcosrf commented 8 years ago

After running the shell script I got Darwin detected, anyways, is not a big deal as building disabling luajit worked perfect ( USE_LUAJIT=OFF ./build.sh )

regarding the memory issue when pulling from torch to python, this is interesting actually, allocating a new numpy array wipes the tensor... Sure, I'll take a look as soon as possible to the above mentioned snippets you wrote and let you know if I can get something.

Thanks once more for the help and time! Good luck with that paper!

hughperkins commented 8 years ago

Note to self: so that I have the code if I get a moment to check this issue, it is here: https://gist.github.com/hughperkins/fdf8e27daec983e69767ef6bbaf1db9f

Current output:

<TorchClass> initialization

before allocating a simple array of zeros: cl.v
 0 3 6 9 12 15 18 21 24 27
[torch.FloatTensor of size 10]

[ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.]

after: cl.v
 0 0 0 0 0 0 0 0 0 0
[torch.FloatTensor of size 10]
hughperkins commented 8 years ago

Note to self: relevant similar issue, and commit/solution:

hughperkins commented 8 years ago

Note to self: the draft fix from earlier, that doesnt work... : https://github.com/hughperkins/pytorch/compare/trying-fix-inline-issue?expand=1

hughperkins commented 8 years ago

Unit test (failing) added in c76846d

hughperkins commented 8 years ago

Added a bunch of debugging in ea08d23

Output: https://gist.github.com/hughperkins/98243072eafe24b8b9f919a306fe9d07

hughperkins commented 8 years ago

Seems like the numpy array inside the PyTorchTensor is having its lifetime linked to that of the parent PyTorchTensor correctly, but the PyTorchTensor itself is being destroyed, ie, not being incref'd based on it being owned by the underlying lua object now.

hughperkins commented 8 years ago

So, it looks like:

    if typestring in ["<class 'numpy.ndarray'>", "<type 'numpy.ndarray'>"]:
        dtypestr = str(something.dtype)
        if dtypestr == 'float32':
            pushSomething(lua, PyTorch._asFloatTensor(something))
            return
        if dtypestr == 'float64':
            pushSomething(lua, PyTorch._asDoubleTensor(something))
            return
        if dtypestr == 'uint8':
            pushSomething(lua, PyTorch._asByteTensor(something))
            return
raise Exception('pushing numpy array with elements of type ' + dtypestr + ' it not currently implemented')

(via

    if type(something) in luaClassesReverse:
        pushObject(lua, something)
        return

)

def pushObject(lua, myobject):
    lua.pushNumber(myobject.__objectId)
    lua.getRegistry()

Obviously simply storing an integer in the lua registry wont in itself lock the lifetime of the PyTorchTensor python object to any lua objects...

hughperkins commented 8 years ago

This is kind of non-trivial to fix (need to start thinking about gc and stuff probably, if start holding a bunch of python-side references to python objects that have been passed lua-side). So, since most stuff works without it for now, I think I shall leave it for now.