ibayer / fastFM

fastFM: A Library for Factorization Machines
http://ibayer.github.io/fastFM
Other
1.07k stars 206 forks source link

cython-wrapper: modified setup.py to link against ffm2. Added cpp_ffm… #94

Closed caos21 closed 7 years ago

caos21 commented 7 years ago

it compiles, but I did not check, I am thinking about how to do this. I got one warning but is not important.

caos21 commented 7 years ago

Even in that case, I don' t follow the problem with Eigen. I will continue on this tomorrow morning.

ibayer commented 7 years ago

Thanks!

caos21 commented 7 years ago

Seems to be a memory problem... weird because in the C++ tests we don't have these errors. I'm checking with valgrind. 0 allocation errors for the C++ tests. On the other hand,

valgrind --log-file="vlog" nosetests

Shows some errors. Also with valgrind the tests complete but with a mismatch:

AssertionError:
Arrays are not equal

(mismatch 100.0%)
 x: array([ 56.,   2.,   2.,   2.,   2.])
 y: array([ 298.,  266.,   29.,  298.,  848.])

----------------------------------------------------------------------
Ran 31 tests in 25.180s

FAILED (failures=1)  

Moreover, I modified the test:

def test_ffm2_predict_w0(): 
    w0, w, V, y, X = get_test_problem()
    w = 0*w
    V = 0*V
    y_pred = ffm2.ffm2_predict(w0, w, V, X)
    assert_equal(y_pred, w0)

because it complains because ffm2_predict is expecting array types for w and V, just a design problem that could be corrected afterwards.

ibayer commented 7 years ago

Can you post the valgrind errors? You managed to run the tests before, right? What changed since then? Does the w0 test pass? If so can you add a test for only w?

ibayer commented 7 years ago

Seems to be a memory problem... weird because in the C++ tests we don't have these errors.

Well, now we pass preallocated (Python) memory to C++ . I guess all kind of nasty stuff can happen, like memory gets out of score etc. Maybe running it with valgrind covers up some dangling pointers or something similar. I would focus on locating and fixing the memory errors first by simplifying the (Python) code and tests and then try to get the correct results.

caos21 commented 7 years ago

The valgrind errors are somewhat lengthy, they are stored in the main dir of fastFM, name starting with vlog.

No tests are working. I don't remember running these python tests, I started looking if importing the module worked.

I agree, memory management is critical. Look at this error when running nosetests:

.......*** Error in `/usr/bin/python': free(): invalid next size (fast): 0x0000000001c3da20 ***

Seems that we are freeing an unallocated address is the problem, a non valid delete ...

Weird is the C++ tests are working without these memory problems...

caos21 commented 7 years ago

where is GCC_MUSL_ROOT?

I found it.

ibayer commented 7 years ago

We should check if we have the same issues with the regular build (not musl gcc).

caos21 commented 7 years ago

I'm working on it. But I think it is not the musl build.

caos21 commented 7 years ago

Do you think this topic is related?

caos21 commented 7 years ago

Now I think the relevant is this, we have std::map

caos21 commented 7 years ago

tried this

caos21 commented 7 years ago

It is challenging, I loved that but running out of ideas. Maybe it is something more stupid.

ibayer commented 7 years ago

We really need to narrow things down by commenting out every potential source of trouble. First by reducing the use of the fastFM2 api to a minimum in the python tests. Then simplify the code in the fastFM2 api code as well. Commenting out the use of std::map etc. ...

ibayer commented 7 years ago
root@ubuntu-16:~/fastFM/fastFM (cython-wrapper-#10 *%)# pip install -e fastFM

I see that you have already changed a lot of things there ( fastFM.cpp), how do you recompile? Does this mean we need to update the hunter path every time?

What's the smallest subset of predict(Model* m, Data* d) that still causes a memory error?

Does it still crash if you only keep:

void predict(Model* m, Data* d){
    Data::Impl* data = Internal::get_impl(d);
    Model::Impl* model = Internal::get_impl(m);

/*
    cd::impl::Predict(data->x_train,
                      model->coef_->getw3(),
                      model->coef_->getw2(),
                      model->coef_->getw1(),
                      model->coef_->getw0(),
                      data->y_pred);
*/
};

Assuming that

    cd::impl::Predict(data->x_train,
                      model->coef_->getw3(),
                      model->coef_->getw2(),
                      model->coef_->getw1(),
                      model->coef_->getw0(),
                      data->y_pred);

causes the crash, which argument causes it? Create a local model and data instances and use them (should not crash any more). Replace local instances step by step with function arguments. Start with data->x_train, data->y_pred and model->coef_->getw0().

Hope this helps.

caos21 commented 7 years ago

Yes I did several several tests, this kind of operations res += x * w1; causes the trouble. It is an eigen memory management problem or we are doing something wrong. I'm testing with custom vectors and matrices.

We don't need to update hunter as far as we do not change flags in the compilation. What I'm doing is:

root@ubuntu-16:~/fastFM/fastFM2 (api_add_fit *%)# cmake --build _regular/

root@ubuntu-16:~/fastFM (cython-wrapper-#10 *%)# pip uninstall fastFM --yes && make clean && make -j 4 && pip install  . --user && nosetests
ibayer commented 7 years ago

Thanks for the commands, I was worried it might require more steps. I'll try out a few things too.

this kind of operations res += x * w1; causes the trouble

That means if we use costum/dummys for w1, w2, w3 it should work for only w0. Getting it to work for w0 would be a great way to exclude a range of other possible issues.

It is an eigen memory management problem or we are doing something wrong

Using Eigen with externally allocated memory is more involved as I would have expected. Most of it is handled in fastfm_impl.h. The code looks somewhat bridle and I wouldn’t be surprised if the error comes form there. However, I don't know if / how to simplify the code and as you have pointed out C++ test have no memory issues (I wouldn't put too much weight on this).

ibayer commented 7 years ago
 def test_ffm2_predict_w0():
     w0, w, V, y, X = get_test_problem()
-    w = 0
-    V = 0
+    w[:] = 0
+    V[:, :] = 0
     y_pred = ffm2.ffm_predict(w0, w, V, X)
     assert_equal(y_pred, w0)

Found the first nasty bug, now the test for w0 seems to pass.

ibayer commented 7 years ago

rebuild cython wrapper

(cd ~/; rm fastFM/*.so; pip install -e fastFM)

rebuild fastFM2

(cd ~/fastFM/fastFM2/; cmake --build _builds)
caos21 commented 7 years ago

Really? it was that?

ibayer commented 7 years ago

Yes, but only for the simplest case w0. The other one still crashes.

caos21 commented 7 years ago

Ok, thanks... you did this in another branch/deploy/machine? I would like to reproduce the test in the server. Sometimes I got the error and sometimes not.

ibayer commented 7 years ago

I have launched an additional machine so that I don't interfere with your work. I can give to access in a few hours.

ibayer commented 7 years ago

Adding debug output.

void predict(Model* m, Data* d){
    Data::Impl* data = Internal::get_impl(d);
    Model::Impl* model = Internal::get_impl(m);

    std::cout << "---------w0----------" << std::endl;
    std::cout << model->coef_->getw0() << std::endl;
    std::cout << "----------w1---------" << std::endl;
    std::cout << model->coef_->getw1() << std::endl;
    std::cout << "-----------w2--------" << std::endl;
    std::cout << model->coef_->getw2() << std::endl;
    std::cout << "----------w3---------" << std::endl;
    std::cout << model->coef_->getw3() << std::endl;

    std::cout << "--------y-----------" << std::endl;
    std::cout << data->y_pred << std::endl;
    std::cout << "--------x-----------" << std::endl;
    std::cout << data->x_train << std::endl;

Sparse matrix mapping failes:

........hello from cpp
---------w0----------
2
----------w1---------
9
2
-----------w2--------
6 0
5 8
----------w3---------

--------y-----------
0
0
0
0
0
--------x-----------
python: /root/.hunter/_Base/033a6ff/14d0f80/75730c8/Install/include/eigen3/Eigen/src/Core/DenseCoeffsBase.h:408: operator[]: Assertion `index >= 0 && index < size()' failed.

lucky guess:

+++ b/fastFM/ffm2.pyx
@@ -8,17 +8,18 @@ from libcpp.memory cimport nullptr
 cimport numpy as np
 import numpy as np

def ffm_predict(double w_0, double[:] w,
                 np.ndarray[np.float64_t, ndim = 2] V, X):

    assert X.shape[1] == len(w)
    assert X.shape[1] == V.shape[1]

     # get attributes from csc scipy
     n_features = X.shape[1]
     n_samples = X.shape[0]
     nnz = X.count_nonzero()
-    cdef np.ndarray[int, ndim=1, mode='c'] outer = X.indices
-    cdef np.ndarray[int, ndim=1, mode='c'] inner = X.indptr
+
+    cdef np.ndarray[int, ndim=1, mode='c'] inner = X.indices
+    cdef np.ndarray[int, ndim=1, mode='c'] outer = X.indptr
     cdef np.ndarray[np.float64_t, ndim=1, mode='c'] data = X.data

run tests again:

........hello from cpp
---------w0----------
2
----------w1---------
9
2
-----------w2--------
6 0
5 8
----------w3---------

--------y-----------
0
0
0
0
0
--------x-----------
6 1 
2 3 
3 0 
6 1 
4 5 

..................
----------------------------------------------------------------------
Ran 31 tests in 0.481s

OK
root@ubuntu-fastFM2-tests:~/fastFM# 

Seems like its working now! :stuck_out_tongue_winking_eye: :smiley: :smiley:

ibayer commented 7 years ago

I have pushed all changes to fastFM cython-wrapper and fastFM2 api.

You can reproduce my changes on:

ssh root@165.227.139.54

please let me know if you want to take over this server and if I can decommission 207.154.217.43.

Next I'll

If you want to write the cython wrapper for the fit function please open a new PR.

caos21 commented 7 years ago

Excellent! :smile: :tada:

oh, I didn't think of that :sweat:

Please don't remove the server yet, I want to check and close sessions first.

I will do it. (fit)

ibayer commented 7 years ago

Please don't remove the server yet, I want to check and close sessions first.

I have restored the server from a snapshot that I have taken before decommissioning. I not sure tmux etc. sessions will survive this. :-/ Sorry...

I your can reach the old 207.154.217.43 now as:

165.227.133.155

I will do it. (fit)

:smile: :+1:

I'm currently cleaning up setup.py and setting up Travis-Ci. I'll let you know if things are in a stable state again.

caos21 commented 7 years ago

It's ok! The tmux session didn't survive but thanks.

ibayer commented 7 years ago

Looks like there is still a portability issue with glog. Building glog on one machine and using it on another results in the following error. https://travis-ci.org/ibayer/fastFM/jobs/248565887

g++ -pthread -shared -L/home/travis/miniconda/envs/test-environment/lib -Wl,-rpath=/home/travis/miniconda/envs/test-environment/lib,--no-as-needed build/temp.linux-x86_64-2.7/fastFM/ffm2.o -LfastFM-core2-binaries -LfastFM-core2-binaries -L/home/travis/miniconda/envs/test-environment/lib -lfastFM -lglog -lpython2.7 -o /home/travis/build/ibayer/fastFM/ffm2.so -std=c++11 -mstackrealign

/usr/bin/ld: fastFM-core2-binaries/libglog.a(logging.cc.o): unrecognized relocation (0x2a) in section `.text'

/usr/bin/ld: final link failed: Bad value

collect2: error: ld returned 1 exit status

error: command 'g++' failed with exit status 1

make: *** [all] Error 1

The glog doc https://github.com/google/glog/blob/master/INSTALL has some comments about -static compile. Update: Seems to be unrelated.

...

Also, if you link binaries statically, make sure that you add
-Wl,--eh-frame-hdr to your linker options. This is required so that
libunwind can find the information generated by the compiler required
for stack unwinding.

Using -static is rare, though, so unless you know this will affect you
it probably won't.
...

I'll play around with the compile flag later.

ibayer commented 7 years ago

The glog issue is further investigated in https://github.com/ibayer/fastFM-core2/issues/20. Cython wrapper for fit can be discussed in it own PR.

ibayer commented 7 years ago

@caos21 One more thing ;-) please let me know when you are done with the servers so that I can shut them down.

caos21 commented 7 years ago

@ibayer Sorry. You can do it now!