Closed caos21 closed 7 years ago
Even in that case, I don' t follow the problem with Eigen. I will continue on this tomorrow morning.
Thanks!
Seems to be a memory problem... weird because in the C++ tests we don't have these errors.
I'm checking with valgrind
. 0 allocation errors for the C++ tests.
On the other hand,
valgrind --log-file="vlog" nosetests
Shows some errors. Also with valgrind
the tests complete but with a mismatch:
AssertionError:
Arrays are not equal
(mismatch 100.0%)
x: array([ 56., 2., 2., 2., 2.])
y: array([ 298., 266., 29., 298., 848.])
----------------------------------------------------------------------
Ran 31 tests in 25.180s
FAILED (failures=1)
Moreover, I modified the test:
def test_ffm2_predict_w0():
w0, w, V, y, X = get_test_problem()
w = 0*w
V = 0*V
y_pred = ffm2.ffm2_predict(w0, w, V, X)
assert_equal(y_pred, w0)
because it complains because ffm2_predict
is expecting array types for w
and V
, just a design problem that could be corrected afterwards.
Can you post the valgrind errors? You managed to run the tests before, right? What changed since then? Does the w0 test pass? If so can you add a test for only w?
Seems to be a memory problem... weird because in the C++ tests we don't have these errors.
Well, now we pass preallocated (Python) memory to C++ . I guess all kind of nasty stuff can happen, like memory gets out of score etc. Maybe running it with valgrind covers up some dangling pointers or something similar. I would focus on locating and fixing the memory errors first by simplifying the (Python) code and tests and then try to get the correct results.
The valgrind errors are somewhat lengthy, they are stored in the main dir of fastFM, name starting with vlog
.
No tests are working. I don't remember running these python tests, I started looking if importing the module worked.
I agree, memory management is critical. Look at this error when running nosetests
:
.......*** Error in `/usr/bin/python': free(): invalid next size (fast): 0x0000000001c3da20 ***
Seems that we are freeing an unallocated address is the problem, a non valid delete
...
Weird is the C++ tests are working without these memory problems...
where is GCC_MUSL_ROOT?
I found it.
We should check if we have the same issues with the regular build (not musl gcc).
I'm working on it. But I think it is not the musl build.
It is challenging, I loved that but running out of ideas. Maybe it is something more stupid.
We really need to narrow things down by commenting out every potential source of trouble. First by reducing the use of the fastFM2 api to a minimum in the python tests. Then simplify the code in the fastFM2 api code as well. Commenting out the use of std::map etc. ...
root@ubuntu-16:~/fastFM/fastFM (cython-wrapper-#10 *%)# pip install -e fastFM
cpp_ffm.predict(m, d)
in ffm.pyx
gets rid of the memory error -> error probably somewhere in predict
in fastFM.cpp
.I see that you have already changed a lot of things there ( fastFM.cpp
), how do you recompile? Does this mean we need to update the hunter path every time?
What's the smallest subset of predict(Model* m, Data* d)
that still causes a memory error?
Does it still crash if you only keep:
void predict(Model* m, Data* d){
Data::Impl* data = Internal::get_impl(d);
Model::Impl* model = Internal::get_impl(m);
/*
cd::impl::Predict(data->x_train,
model->coef_->getw3(),
model->coef_->getw2(),
model->coef_->getw1(),
model->coef_->getw0(),
data->y_pred);
*/
};
Assuming that
cd::impl::Predict(data->x_train,
model->coef_->getw3(),
model->coef_->getw2(),
model->coef_->getw1(),
model->coef_->getw0(),
data->y_pred);
causes the crash, which argument causes it? Create a local model
and data
instances
and use them (should not crash any more). Replace local instances step by step with function arguments. Start with data->x_train
, data->y_pred
and model->coef_->getw0()
.
Hope this helps.
Yes I did several several tests, this kind of operations res += x * w1;
causes the trouble. It is an eigen memory management problem or we are doing something wrong. I'm testing with custom vectors and matrices.
We don't need to update hunter as far as we do not change flags in the compilation. What I'm doing is:
root@ubuntu-16:~/fastFM/fastFM2 (api_add_fit *%)# cmake --build _regular/
root@ubuntu-16:~/fastFM (cython-wrapper-#10 *%)# pip uninstall fastFM --yes && make clean && make -j 4 && pip install . --user && nosetests
Thanks for the commands, I was worried it might require more steps. I'll try out a few things too.
this kind of operations res += x * w1; causes the trouble
That means if we use costum/dummys for w1, w2, w3 it should work for only w0. Getting it to work for w0 would be a great way to exclude a range of other possible issues.
It is an eigen memory management problem or we are doing something wrong
Using Eigen with externally allocated memory is more involved as I would have expected. Most of it is
handled in fastfm_impl.h
. The code looks somewhat bridle and I wouldn’t be surprised if the error comes form there. However, I don't know if / how to simplify the code and as you have pointed out C++ test have no memory issues (I wouldn't put too much weight on this).
def test_ffm2_predict_w0():
w0, w, V, y, X = get_test_problem()
- w = 0
- V = 0
+ w[:] = 0
+ V[:, :] = 0
y_pred = ffm2.ffm_predict(w0, w, V, X)
assert_equal(y_pred, w0)
Found the first nasty bug, now the test for w0
seems to pass.
rebuild cython wrapper
(cd ~/; rm fastFM/*.so; pip install -e fastFM)
rebuild fastFM2
(cd ~/fastFM/fastFM2/; cmake --build _builds)
Really? it was that?
Yes, but only for the simplest case w0
. The other one still crashes.
Ok, thanks... you did this in another branch/deploy/machine? I would like to reproduce the test in the server. Sometimes I got the error and sometimes not.
I have launched an additional machine so that I don't interfere with your work. I can give to access in a few hours.
Adding debug output.
void predict(Model* m, Data* d){
Data::Impl* data = Internal::get_impl(d);
Model::Impl* model = Internal::get_impl(m);
std::cout << "---------w0----------" << std::endl;
std::cout << model->coef_->getw0() << std::endl;
std::cout << "----------w1---------" << std::endl;
std::cout << model->coef_->getw1() << std::endl;
std::cout << "-----------w2--------" << std::endl;
std::cout << model->coef_->getw2() << std::endl;
std::cout << "----------w3---------" << std::endl;
std::cout << model->coef_->getw3() << std::endl;
std::cout << "--------y-----------" << std::endl;
std::cout << data->y_pred << std::endl;
std::cout << "--------x-----------" << std::endl;
std::cout << data->x_train << std::endl;
Sparse matrix mapping failes:
........hello from cpp
---------w0----------
2
----------w1---------
9
2
-----------w2--------
6 0
5 8
----------w3---------
--------y-----------
0
0
0
0
0
--------x-----------
python: /root/.hunter/_Base/033a6ff/14d0f80/75730c8/Install/include/eigen3/Eigen/src/Core/DenseCoeffsBase.h:408: operator[]: Assertion `index >= 0 && index < size()' failed.
lucky guess:
+++ b/fastFM/ffm2.pyx
@@ -8,17 +8,18 @@ from libcpp.memory cimport nullptr
cimport numpy as np
import numpy as np
def ffm_predict(double w_0, double[:] w,
np.ndarray[np.float64_t, ndim = 2] V, X):
assert X.shape[1] == len(w)
assert X.shape[1] == V.shape[1]
# get attributes from csc scipy
n_features = X.shape[1]
n_samples = X.shape[0]
nnz = X.count_nonzero()
- cdef np.ndarray[int, ndim=1, mode='c'] outer = X.indices
- cdef np.ndarray[int, ndim=1, mode='c'] inner = X.indptr
+
+ cdef np.ndarray[int, ndim=1, mode='c'] inner = X.indices
+ cdef np.ndarray[int, ndim=1, mode='c'] outer = X.indptr
cdef np.ndarray[np.float64_t, ndim=1, mode='c'] data = X.data
run tests again:
........hello from cpp
---------w0----------
2
----------w1---------
9
2
-----------w2--------
6 0
5 8
----------w3---------
--------y-----------
0
0
0
0
0
--------x-----------
6 1
2 3
3 0
6 1
4 5
..................
----------------------------------------------------------------------
Ran 31 tests in 0.481s
OK
root@ubuntu-fastFM2-tests:~/fastFM#
Seems like its working now! :stuck_out_tongue_winking_eye: :smiley: :smiley:
I have pushed all changes to fastFM cython-wrapper
and fastFM2 api
.
You can reproduce my changes on:
ssh root@165.227.139.54
please let me know if you want to take over this server and if I can decommission 207.154.217.43
.
Next I'll
If you want to write the cython wrapper for the fit
function please open a new PR.
Excellent! :smile: :tada:
oh, I didn't think of that :sweat:
Please don't remove the server yet, I want to check and close sessions first.
I will do it. (fit)
Please don't remove the server yet, I want to check and close sessions first.
I have restored the server from a snapshot that I have taken before decommissioning. I not sure tmux etc. sessions will survive this. :-/ Sorry...
I your can reach the old 207.154.217.43
now as:
165.227.133.155
I will do it. (fit)
:smile: :+1:
I'm currently cleaning up setup.py
and setting up Travis-Ci. I'll let you know if things are in a stable state again.
It's ok! The tmux
session didn't survive but thanks.
Looks like there is still a portability issue with glog. Building glog on one machine and using it on another results in the following error. https://travis-ci.org/ibayer/fastFM/jobs/248565887
g++ -pthread -shared -L/home/travis/miniconda/envs/test-environment/lib -Wl,-rpath=/home/travis/miniconda/envs/test-environment/lib,--no-as-needed build/temp.linux-x86_64-2.7/fastFM/ffm2.o -LfastFM-core2-binaries -LfastFM-core2-binaries -L/home/travis/miniconda/envs/test-environment/lib -lfastFM -lglog -lpython2.7 -o /home/travis/build/ibayer/fastFM/ffm2.so -std=c++11 -mstackrealign
/usr/bin/ld: fastFM-core2-binaries/libglog.a(logging.cc.o): unrecognized relocation (0x2a) in section `.text'
/usr/bin/ld: final link failed: Bad value
collect2: error: ld returned 1 exit status
error: command 'g++' failed with exit status 1
make: *** [all] Error 1
The glog doc https://github.com/google/glog/blob/master/INSTALL has some comments about -static
compile. Update: Seems to be unrelated.
...
Also, if you link binaries statically, make sure that you add
-Wl,--eh-frame-hdr to your linker options. This is required so that
libunwind can find the information generated by the compiler required
for stack unwinding.
Using -static is rare, though, so unless you know this will affect you
it probably won't.
...
I'll play around with the compile flag later.
The glog issue is further investigated in https://github.com/ibayer/fastFM-core2/issues/20.
Cython wrapper for fit
can be discussed in it own PR.
@caos21 One more thing ;-) please let me know when you are done with the servers so that I can shut them down.
@ibayer Sorry. You can do it now!
it compiles, but I did not check, I am thinking about how to do this. I got one warning but is not important.