Closed Maratyszcza closed 7 years ago
yes, do you have an ssh to a ppc64 machine that i can use to fix these issues? @Jokeren 's optimizations PR broke the VSX part as the interfaces were changed.
Nope, its a university machine, I have only user-level access to it.
@soumith are you able to explain what dependency is broken? I can't find the PR you mentioned.
@soumith Looks like the issue is that it is looks for THFloatVector_adds_VSX but the function is THFloatVector_add_VSX. Commit 5ca6516e adds an 's' onto a whole bunch of things, maybe this introduced the problem?
I'm running into this error when trying to install modules via luarocks (and so cannot run tests that require these modules). I would appreciate any help in getting this resolved! @soumith I'm happy to try things out on a ppc64le machine on your behalf, or you could use qemu, SuperVessel or osuosl to set up a ppc test environment.
I'm running into the same issue.
@soumith it is failing with not declared error for THFloatVector_adds_VSX
and THFloatVector_muls_VSX
. Also there is the minicloud, which you can create and use a PPC64le machine.
I've requested access to minicloud, and once I get access I shall fix the issue.
@gchanan is going to look into it, we got access to minicloud.
Hi guys, any news on the subject?
https://github.com/torch/torch7/pull/990 seems to fix the compile issue. At least with my combination of OS (CentOS7) /compiler (gcc 4.8.x)/etc tests were failing because ptrdiff_t
is defined to be unsigned long
(it definitely shouldn't be), and my attempt at quick workarounds didn't fix all the issues. You might want to try with an up-to-date compiler and see if it works for you.
pytorch should now be building with master.
But as @gchanan noted, when we used the minicloud systems, ptrdiff_t
was defined as unsigned long
, which makes no sense (it has to be signed
). The OS was CentOS7 and compiler was gcc 4.8.3.
We are assuming that a more up-to-date compiler or a different compiler fixes this on PPC64, and that's how you guys have been using torch7 under the https://github.com/PPC64/ branch.
If you are going to try pytorch on an OpenPower system, please run unit tests and make sure they pass.
Hi,
pytorch is now building for me now, but unit tests give me Segmentation Fault
(gan) user@minsky31:~$ repositories/pytorch/test/run_test.sh ~/repositories/pytorch/test ~ Running torch tests ...repositories/pytorch/test/run_test.sh: line 24: 66034 Segmentation fault (core dumped) $PYCMD test_torch.py $@
I am using gcc (Ubuntu/IBM 5.4.0-6ubuntu1~16.04.4) 5.4.0 20160609
yes, see my comment above. Something is weird on IBM. @tomsercu if there's a chance you can ask one of the IBM engineers about https://github.com/pytorch/pytorch/issues/922#issuecomment-290814414 and why it deviates from the OpenPower spec, it would be great.
@soumith , I see. But are we sure this is true for all OSes and compilers? I am running Ubuntu 16.04 with gcc 4.8.5 and when I compile the following code with the -Wconversion
flag it gives me no warning, while the same is not true if I use an unsigned long
variable.
(pytorch) pedro@hal9000:~/repositories/pytorch$ cat testptr.c
#include "stdio.h"
#include "stddef.h"
int main(){
ptrdiff_t var = -3;
return 0;
}
(pytorch) pedro@hal9000:~/repositories/pytorch$ gcc -Wconversion testptr.c
(pytorch) pedro@hal9000:~/repositories/pytorch$
(pytorch) pedro@hal9000:~/repositories/pytorch$ cat testptr.c
#include "stdio.h"
#include "stddef.h"
int main(){
unsigned long var = -3;
return 0;
}
(pytorch) pedro@hal9000:~/repositories/pytorch$ gcc -Wconversion testptr2.c
testptr2.c: In function ‘main’:
testptr2.c:4:5: warning: negative integer implicitly converted to unsigned type [-Wsign-conversion]
unsigned long var = -3;
^
Is this a valid test? Thanks
Just from the spec:
std::ptrdiff_t is the signed integer type of the result of subtracting two pointers.
I'm just saying that on the platform we saw, ptrdiff_t was unsigned long. We are not sure why. But that's the reason for the segfault and the incorrect behavior of rest of pytorch under PPC64 in what we tested.
Your test is valid. we want ptrdiff_t
to be signed, and in your test case it gives no warning correctly when using ptrdiff_t
which is a signed type.
When I'm compiling @pedropgusmao 's tests, similar as for him, no problem for ptrdiff_t
ie for me it is signed as it should be.
That is on rhel7 gcc 4.8.5.
So this might be specific to minicloud? I'm checking whether we can give you access to an IBM machine.
If ptrdiff_t
is unsigned on your test machine, does that mean it is incorrecly defined in the Linux header files?
In accordance with the ELFV2 ABI, the definition of ptrdiff_t
should be
• typedef long long ptrdiff_t;
Weird, for both my ppc64le machines - CentOS 7.2 with gcc 4.8.5 and Ubuntu 16.04 with gcc 4.5.x - ptrdiff_t
was signed.
ok then we'll resign this to some weirdness with the minicloud systems then. If you guys compile pytorch and PPC64 and all unit tests pass, then you're good to go :)
At first sight no existing way to give you acces to a power machine at IBM, but the recommedation is that Power users obtain virtual machines on NIMBIX (a Power cloud with with support for up-to-date OSes). Tomorrow I'll compile torch & pytorch with this recent fix and update if any issues or segfaults.
I have compiled pytorch from source and, although compilation runs fine, I still get:
(gan) user@minsky31:~/repositories/pytorch/test$ ./run_test.sh
~/repositories/pytorch/test ~/repositories/pytorch/test
Running torch tests
..../run_test.sh: line 24: 88566 Segmentation fault (core dumped) $PYCMD test_torch.py $@
GDB gives me:
(gan) user@minsky31:~/repositories/pytorch/test$ gdb python
GNU gdb (Ubuntu 7.11.1-0ubuntu1~16.04) 7.11.1
Copyright (C) 2016 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "powerpc64le-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from python...done.
(gdb) run test_torch.py
Starting program: /home/user/miniconda3/envs/gan/bin/python test_torch.py
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/powerpc64le-linux-gnu/libthread_db.so.1".
[New Thread 0x3fff18eef1a0 (LWP 89871)]
...
Thread 1 "python" received signal SIGSEGV, Segmentation fault.
0x00003fff346d9ff4 in THDoubleVector_muls_VSX () from /home/user/miniconda3/envs/gan/lib/python3.6/site-packages/torch/lib/libTH.so.1
FYI,
I also tried to run the test in https://github.com/pytorch/pytorch/blob/master/torch/lib/TH/vector/VSX.c (where THDoubleVector_muls_VSX () is defined ) and got some definition problems. It seems that the standardDouble_mul()
and standardFloat_mul()
were defined but the code actually required standardDouble_muls() and standardFloat_muls(), as noticed by @RashmicaG.
Modifying the code to new function names resulted in all All tests passed:
standardDouble_fill() test took 0.33023 seconds
THDoubleVector_fill_VSX() test took 0.16157 seconds
All assertions PASSED for THDoubleVector_fill_VSX() test.
standardFloat_fill() test took 0.31468 seconds
THFloatVector_fill_VSX() test took 0.08073 seconds
All assertions PASSED for THFloatVector_fill_VSX() test.
standardDouble_adds() test took 0.45609 seconds
THDoubleVector_adds_VSX() test took 0.30265 seconds
All assertions PASSED for THDoubleVector_adds_VSX() test.
standardFloat_adds() test took 0.38353 seconds
THFloatVector_adds_VSX() test took 0.14310 seconds
All assertions PASSED for THFloatVector_adds_VSX() test.
standardDouble_diff() test took 0.42574 seconds
THDoubleVector_diff_VSX() test took 0.29341 seconds
All assertions PASSED for THDoubleVector_diff_VSX() test.
standardFloat_diff() test took 0.48570 seconds
THFloatVector_diff_VSX() test took 0.15327 seconds
All assertions PASSED for THFloatVector_diff_VSX() test.
standardDouble_scale() test took 0.31753 seconds
THDoubleVector_scale_VSX() test took 0.17237 seconds
All assertions PASSED for THDoubleVector_scale_VSX() test.
standardFloat_scale() test took 0.31478 seconds
THFloatVector_scale_VSX() test took 0.08644 seconds
All assertions PASSED for THFloatVector_scale_VSX() test.
standardDouble_muls() test took 0.45221 seconds
THDoubleVector_muls_VSX() test took 0.29897 seconds
All assertions PASSED for THDoubleVector_muls_VSX() test.
standardFloat_muls() test took 0.38739 seconds
THFloatVector_muls_VSX() test took 0.14117 seconds
All assertions PASSED for THFloatVector_muls_VSX() test.
Finished runnning all tests. All tests PASSED.
@pedropgusmao Hi, I see what you mean:
$ gcc ./torch/lib/TH/vector/VSX.c -o vsx -DRUN_VSX_TESTS
./torch/lib/TH/vector/VSX.c: In function ‘test_THDoubleVector_muls_VSX’:
./torch/lib/TH/vector/VSX.c:1739:5: warning: implicit declaration of function ‘standardDouble_muls’ [-Wimplicit-function-declaration]
standardDouble_muls(y_standard, x, VSX_PERF_NUM_TEST_ELEMENTS );
^
./torch/lib/TH/vector/VSX.c: In function ‘test_THFloatVector_muls_VSX’:
./torch/lib/TH/vector/VSX.c:1811:5: warning: implicit declaration of function ‘standardFloat_muls’ [-Wimplicit-function-declaration]
standardFloat_muls(y_standard, x, VSX_PERF_NUM_TEST_ELEMENTS );
^
/tmp/ccTWsGeZ.o: In function `test_THDoubleVector_muls_VSX':
VSX.c:(.text+0xbfcc): undefined reference to `standardDouble_muls'
VSX.c:(.text+0xbfe4): undefined reference to `standardDouble_muls'
VSX.c:(.text+0xbffc): undefined reference to `standardDouble_muls'
VSX.c:(.text+0xc014): undefined reference to `standardDouble_muls'
VSX.c:(.text+0xc134): undefined reference to `standardDouble_muls'
/tmp/ccTWsGeZ.o:VSX.c:(.text+0xc168): more undefined references to `standardDouble_muls' follow
/tmp/ccTWsGeZ.o: In function `test_THFloatVector_muls_VSX':
VSX.c:(.text+0xc534): undefined reference to `standardFloat_muls'
VSX.c:(.text+0xc54c): undefined reference to `standardFloat_muls'
VSX.c:(.text+0xc564): undefined reference to `standardFloat_muls'
VSX.c:(.text+0xc57c): undefined reference to `standardFloat_muls'
VSX.c:(.text+0xc69c): undefined reference to `standardFloat_muls'
/tmp/ccTWsGeZ.o:VSX.c:(.text+0xc6d0): more undefined references to `standardFloat_muls' follow
collect2: error: ld returned 1 exit status
But after adding your diff:
diff --git i/torch/lib/TH/vector/VSX.c w/torch/lib/TH/vector/VSX.c
index 796d3b8..fbc7d61 100644
--- i/torch/lib/TH/vector/VSX.c
+++ w/torch/lib/TH/vector/VSX.c
@@ -1105,13 +1105,13 @@ static void standardFloat_scale(float *y, const float c, const ptrdiff_t n)
y[i] *= c;
}
-static void standardDouble_mul(double *y, const double *x, const ptrdiff_t n)
+static void standardDouble_muls(double *y, const double *x, const ptrdiff_t n)
{
for (ptrdiff_t i = 0; i < n; i++)
y[i] *= x[i];
}
-static void standardFloat_mul(float *y, const float *x, const ptrdiff_t n)
+static void standardFloat_muls(float *y, const float *x, const ptrdiff_t n)
{
for (ptrdiff_t i = 0; i < n; i++)
y[i] *= x[i];
I coud compile and run VSX's tests, however I still didn't accomplish to have torch's tests passing. They segfault as you said on https://github.com/pytorch/pytorch/issues/922#issuecomment-291429297
Does anybody have progress on this?
I'm still digging, but I believe there is a problem in the interface between torch.mul()
and the vector multiplication functions in VSX.c
. Also I think functions in VSX.c
are only called when multiplication does not use slices.
I say this because the segmentation fault that emerges for torch's set of tests is actually generated from test_abs
and not from test_mul
.
You can check this by running:
$ ptyhon path_to_pytorch/test/test_torch.py TestTorch.test_abs
Segmentation fault (core dumped)
and
$ ptyhon path_to_pytorch/test/test_torch.py TestTorch.test_mul
.
----------------------------------------------------------------------
Ran 1 test in 0.001s
OK
Now, if you check the code for each one of these tests you will see that test_abs
uses tensor.mul(value)
while test_mul
uses a sliced_tensor[:,3].mul_(value)
.
This become clearer if you add the following tests to the same script
def test_mul_slice_mat(self):
m1 = torch.ones(10,10)
m = m1[:,0].mul(5)
def test_mul_vec(self):
m1 = torch.ones(10,1)
m = m1.mul(5)
and run them as before.
At least for me, using the sliced version gives no error while the normal version it gives me segfault.
Now, when compiling pytorch with a DEBUG flag (I guess I just exported export DEBUG=1
before compiling it) and using gdb
with a breakpoint at path_to_pytorch/torch/lib/TH/vector/VSX.c:408
(inside the function but outside a loop) it lets me see that test_abs
enters THDoubleVector_muls_VSX
while test_mul
doesn't .
(gan) user@minsky31:~/repositories/pytorch$ gdb python
GNU gdb (Ubuntu 7.11.1-0ubuntu1~16.04) 7.11.1
....
(gdb) b /home/user/repositories/pytorch/torch/lib/TH/vector/VSX.c:408
No source file named /home/user/repositories/pytorch/torch/lib/TH/vector/VSX.c.
Make breakpoint pending on future shared library load? (y or [n]) y
Breakpoint 1 (/home/user/repositories/pytorch/torch/lib/TH/vector/VSX.c:408) pending.
(gdb) run test/test_torch.py TestTorch.test_abs
Starting program: /home/user/miniconda3/envs/gan/bin/python test/test_torch.py TestTorch.test_abs
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/powerpc64le-linux-gnu/libthread_db.so.1".
[New Thread 0x3fff156ef1a0 (LWP 3286)]
Thread 1 "python" hit Breakpoint 1, THDoubleVector_muls_VSX (y=0x10ba1e00, x=0x10b9fe00, n=70367515247288) at /home/user/repositories/pytorch/torch/lib/TH/vector/VSX.c:411
411 for (i = 0; i <= n-24; i += 24)
and
(gan) user@minsky31:~/repositories/pytorch$ gdb python
GNU gdb (Ubuntu 7.11.1-0ubuntu1~16.04) 7.11.1
...
(gdb) b /home/user/repositories/pytorch/torch/lib/TH/vector/VSX.c:408
No source file named /home/user/repositories/pytorch/torch/lib/TH/vector/VSX.c.
Make breakpoint pending on future shared library load? (y or [n]) y
Breakpoint 1 (/home/user/repositories/pytorch/torch/lib/TH/vector/VSX.c:408) pending.
(gdb) run test/test_torch.py TestTorch.test_mul
Starting program: /home/user/miniconda3/envs/gan/bin/python test/test_torch.py TestTorch.test_mul
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/powerpc64le-linux-gnu/libthread_db.so.1".
[New Thread 0x3fff156ef1a0 (LWP 3396)]
.
----------------------------------------------------------------------
Ran 1 test in 0.001s
OK
[Thread 0x3fff156ef1a0 (LWP 3396) exited]
[Inferior 1 (process 3393) exited normally]
At this point, if we check the values of n
and y
inside THDoubleVector_muls_VSX
(for test_abs), we see some strange things:
(gdb) p n
$1 = 70367515247288
(gdb) p *x
$2 = 0.69646918727084994
(gdb) p *y
$3 = 0
(gdb) p *(y+9)
$4 = 0
(gdb) p *(y+10)
$5 = 0
(gdb) p *(x+9)
$6 = 0.49111893260851502
(gdb) p *(x+10)
$7 = 0.42310646059922874
(gdb) p *x
$8 = 0.69646918727084994
(gdb) p *(x+999)
$9 = 0.56434643105603755
(gdb) p *(x+1000)
$10 = 0
(gdb) p *(y+999)
$11 = 0
(gdb) p *(y+1000)
$12 = 0
(gdb) p *(y)
$13 = 0
It is as if x
is correctly passed (its size should be 1000 in test_abs); y
is not correctly passed; and n
, which I believe should be the length of each vector (expanded in the case of multiplication by constant), is just a huge number.
If someone could point me to where mul
calls THDoubleVector_muls_VSX
and alike, I could keep digging.
Ok , I have made some progress...
$ python test/test_torch.py
......FF.F...F..............sss......F..F.........s.....terminate called after throwing an instance of 'THException'
what(): Invalid index in gather at /home/gut/pytorch/torch/lib/TH/generic/THTensorMath.c:441
Aborted
It turns out the interface changed and the code was not updated. THDoubleVector_muls
calls THDoubleVector_muls_VSX
with 4 parameters but the interface on torch/lib/TH/vector/VSX.c
only declares 3. Somethings goes bad and n
is corrupted:
#1 0x00003fff90178ea0 in THDoubleVector_muls (y=0x20c3f8c0, x=0x20c3d900, c=1000, n=1000) at /home/gut/pytorch/torch/lib/TH/generic/THVectorDispatch.c:171
171 THVector_(muls_DISPATCHPTR)(y, x, c, n);
(gdb) down
#0 THDoubleVector_muls_VSX (y=0x20c3f8c0, x=0x20c3d900, n=70366857724600) at /home/gut/pytorch/torch/lib/TH/vector/VSX.c:412
So I had to apply the first diff referenced on https://github.com/pytorch/pytorch/issues/922#issuecomment-291830986 and now this one:
index fbc7d61..2cb80a4 100644
--- i/torch/lib/TH/vector/VSX.c
+++ w/torch/lib/TH/vector/VSX.c
@@ -399,7 +399,7 @@ static void THDoubleVector_scale_VSX(double *y, const double c, const ptrdiff_t
}
-static void THDoubleVector_muls_VSX(double *y, const double *x, const ptrdiff_t n)
+static void THDoubleVector_muls_VSX(double *y, const double *x, const double c, const ptrdiff_t n)
{
ptrdiff_t i;
Why is c
not needed on VSX? should it? Well, the internal tests call test_THDoubleVector_muls
so the parameters need to be reviewed...
there's a huge mistake here. @gut really triggered my memory. So,
THXVector_mul(double *y, const double *x, const ptrdiff_t n)
used to be the operation:
y = y * x
See this old commit that has the non-SIMD versions: https://github.com/torch/torch7/blob/1e86025cfa78e13da16835cfc8d459eedfb71f15/lib/TH/generic/THVectorDefault.c
After some commits from Keren Zhou revamping our SIMD stuff, we removed that function, and added:
THXVector_muls(double *y, const double *x, const double c, const ptrdiff_t n)
which is
y = x * c
Clearly they are not compatible. I wonder if we should remove the VSX.c file, unless @gut or @tomsercu want to actually retain the VSX optimizations and fix the functions to actually do this new prototype. I myself dont know the VSX instruction set to fix this stuff.
@soumith thanks for the explanation. That looks clear to me. I'll start implementing it and will reply to this issue soon to report my progress.
By applying this diff, the VSX test looks ok on my machine:
standardDouble_fill() test took 3.52016 seconds
THDoubleVector_fill_VSX() test took 0.61788 seconds
All assertions PASSED for THDoubleVector_fill_VSX() test.
standardFloat_fill() test took 3.28862 seconds
THFloatVector_fill_VSX() test took 0.20691 seconds
All assertions PASSED for THFloatVector_fill_VSX() test.
standardDouble_adds() test took 3.23111 seconds
THDoubleVector_adds_VSX() test took 1.26806 seconds
All assertions PASSED for THDoubleVector_adds_VSX() test.
standardFloat_adds() test took 3.09554 seconds
THFloatVector_adds_VSX() test took 0.61319 seconds
All assertions PASSED for THFloatVector_adds_VSX() test.
standardDouble_diff() test took 3.48718 seconds
THDoubleVector_diff_VSX() test took 1.20493 seconds
All assertions PASSED for THDoubleVector_diff_VSX() test.
standardFloat_diff() test took 3.23055 seconds
THFloatVector_diff_VSX() test took 0.56289 seconds
All assertions PASSED for THFloatVector_diff_VSX() test.
standardDouble_scale() test took 3.15807 seconds
THDoubleVector_scale_VSX() test took 0.83008 seconds
All assertions PASSED for THDoubleVector_scale_VSX() test.
standardFloat_scale() test took 3.13747 seconds
THFloatVector_scale_VSX() test took 0.42771 seconds
All assertions PASSED for THFloatVector_scale_VSX() test.
standardDouble_muls() test took 3.22186 seconds
THDoubleVector_muls_VSX() test took 0.86646 seconds
All assertions PASSED for THDoubleVector_muls_VSX() test.
standardFloat_muls() test took 3.15087 seconds
THFloatVector_muls_VSX() test took 0.41949 seconds
All assertions PASSED for THFloatVector_muls_VSX() test.
Finished runnning all tests. All tests PASSED.
However the test suite still doesn't work:
python test/test_torch.py
...F..F..terminate called after throwing an instance of 'THException'
what(): given function should return a number at /home/gut/pytorch/torch/csrc/generic/TensorMethods.cpp:3888
Aborted
When debugging, I see that:
#8 0x00003fff8fdac2ac in _THError (file=0x3fff9067ee00 "/home/gut/pytorch/torch/csrc/generic/TensorMethods.cpp", line=3888,
fmt=0x3fff9067f520 "given function should return a number") at /home/gut/pytorch/torch/lib/TH/THGeneral.c:54
54 (*defaultErrorHandler)(msg, defaultErrorHandlerData);
(gdb) up
#9 0x00003fff902c23c4 in THPDoubleTensor_apply (self=0x3fff8f23a7e8, arg=0x3fff8f1f08c0) at /home/gut/pytorch/torch/csrc/generic/TensorMethods.cpp:3877
3877 TH_TENSOR_APPLY(real, tensor,
(gdb ppc) l
3872 THPUtils_setError("apply requires a callable as it's first argument");
3873 return NULL;
3874 }
3875
3876 THTensor *tensor = self->cdata;
3877 TH_TENSOR_APPLY(real, tensor,
3878 PyObject *ret =
3879 PyObject_CallFunction(arg, (char*)BUILD_REAL_FMT, *tensor_data);
3880 if (!ret)
3881 return NULL;
(gdb)
3882 if (!THPUtils_(checkReal)(ret)) {
3883 Py_DECREF(ret);
3884 THError("given function should return a number");
3885 }
3886 *tensor_data = THPUtils_(unpackReal)(ret);
3887 Py_DECREF(ret);
3888 );
3889
3890 Py_INCREF(self);
3891 return (PyObject*)self;
Any hints? I'm analysing this but I'm quite new on torch.
Hi @gut, I am currently working on it. I guess by tomorrow I will have a VSX.c file with tests etc...
The problem I guess is that not only muls
and cmul
changed but also adds
. I am currenlty writing/modifying calls for https://github.com/pytorch/pytorch/blob/master/torch/lib/TH/generic/THVectorDefault.c
@pedropgusmao alright. I hope the https://gist.github.com/gut/0bac048d1539b26d326c7eb231e92df9 helps you.
Hi @gut, indeed we were working on the same thing. Could you please check out https://gist.github.com/pedropgusmao/fe283d613a3f47ea57b3bf6f81f85fed and see if it works for you. Right now I get 147/157 passing tests. The ones that are not passing are: test_linspace
, test_logspace
, test_gather
, test_scatter
, test_scatterFill
, test_masked_copy
, test_view
, test_unsqueeze
test_numpy_unresizable
and test_apply
. Exceptions are of type THArgException
and THException
Hi @soumith, the previous tests are failing during assertRaises
. They are each throwing a specific TH exception which is not being accepted as RuntimeError
. Any idea why this is happening? Thanks
test_linspace:
throws : 'THArgException'
expects: RuntimeError
test_logspace:
throws : 'THArgException'
expects: RuntimeError
test_gather:
throws : 'THException'
expects: RuntimeError
test_scatter:
throws : 'THException'
expects: RuntimeError
test_scatterFill:
throws : 'THException'
expects: RuntimeError
test_masked_copy:
throws : 'THException'
expects: RuntimeError
test_view:
throws : 'THArgException'
expects: RuntimeError
test_unsqueeze:
throws : 'THArgException'
expects: RuntimeError
test_numpy_unresizable:
throws : 'THException'
expects: RuntimeError
test_apply:
throws : 'THException'
expects: RuntimeError
@pedropgusmao Hi. It looks like it solved the tests that I saw failing:
.........terminate called after throwing an instance of 'THException'
what(): given function should return a number at /home/gut/pytorch/torch/csrc/generic/TensorMethods.cpp:3888
Aborted
Please note that the execution is being aborted, so I don't know what happened for all tests and I'm curious why yours doesn't abort as well.
Hi @gut, I had to run one test at a time.
I first got a list containing the names for each test:
grep 'def test_' pytorch/test/test_torch.py | sed 's/^[^t]*//'| sed 's/(self)\://' > pytorch/list_tests.txt
then I ran each one of those individually:
while read in; do echo "$in" && python pytorch/test/test_torch.py TestTorch."$in"; done < pytorch/list_tests.txt &> test_passed.txt
If this also works for you and if the "exception" problem is not related to the VSX.c call, then I will open a PR for this file. Thanks
Hello @pedropgusmao , nice. Now I see that the one is aborting the test suite looks like it's the "test_apply", as it's the only one with exception given function should return a number at /home/gut/pytorch/torch/csrc/generic/TensorMethods.cpp:3888
And these are all the exceptions for the test_torch.py
suite. I'll try to take a look at them. Whenever I find out something, I'll post it here:
$ grep pytorch test_passed.txt
what(): invalid number of points at /home/gut/pytorch/torch/lib/TH/generic/THTensorMath.c:2970
what(): invalid number of points at /home/gut/pytorch/torch/lib/TH/generic/THTensorMath.c:2993
what(): Invalid index in gather at /home/gut/pytorch/torch/lib/TH/generic/THTensorMath.c:441
what(): Invalid index in scatter at /home/gut/pytorch/torch/lib/TH/generic/THTensorMath.c:466
what(): Invalid index in scatter at /home/gut/pytorch/torch/lib/TH/generic/THTensorMath.c:489
what(): Number of elements of src < number of ones in mask at /home/gut/pytorch/torch/lib/TH/generic/THTensorMath.c:168
what(): size '[15 x 0]' is invalid for input of with 15 elements at /home/gut/pytorch/torch/lib/TH/THStorage.c:59
what(): cannot unsqueeze empty tensor at /home/gut/pytorch/torch/lib/TH/generic/THTensor.c:530
what(): Trying to resize storage that is not resizable at /home/gut/pytorch/torch/lib/TH/generic/THStorage.c:183
what(): given function should return a number at /home/gut/pytorch/torch/csrc/generic/TensorMethods.cpp:3888
okay, for some reason THArgException isn't being rewrapped as a RuntimeError on this platform. @apaszke do you have ideas on why?
well, now it builds, so I guess this issue can be closed. However the tests are still not passing, so we might want to open another one
guys? @soumith @apaszke
I guess we can just rename this one. Any clues on what causes the tests to fail?
Or nvm, there are so many messages that it's probably better to move the discussion elsewhere -> #1297.
Hate to re-open such an long thread :P
I'm trying to install PyTorch on a PPC64 system. The currently available instructions for installing PyTorch (https://github.com/pytorch/pytorch#from-source) fail because the pip whl and conda channel do not have binaries for PPC64.
It seems like the old egg install files for version 0.1 worked on PPC, but is there a recommended way to install the recent version 0.2 on PPC systems?
Also, when building from source on PPC, the compiler seems to go into an infinite loop on the output line
[ 6%] Building C object CMakeFiles/TH.dir/THTensor.c.o
..and this turns out to be a pretty hard problem to fix.
Currently the only way to install pytorch on ppc64le is by building it from source. I am using ubuntu 16.04. Download the 0.2.0 branch. Follow some of the instructions in the Dockerfile for installing conda with a slight mod for ppc64. curl -o ~/miniconda.sh -O https://repo.continuum.io/miniconda/Miniconda3-4.2.12-Linux-ppc64le.sh (there should be a newer version). chmod the sh file and install anaconda. And then proceed to install the required packages minus mkl . I also built magma-cuda version 2.2 http://icl.cs.utk.edu/projectsfiles/magma/downloads/magma-2.2.0.tar.gz. And then /opt/conda/bin/python setup.py install. I am also using cuda but it should work with cpu only. There are some failing tests in test_cuda when using CharTensor. All the other functional test seems to pass.
Unfortunately, I don't have docker to work with. All I have is a native Anaconda running on PPC64.
Still..you're telling me that setup.py ran with no modifications inside your PPC64le environment?
I've built both in Docker and baremetal. It built "out of the box". No mods to setup.py. I've built both using anaconda (using python 3.5) and outside of anaconda (using 2.7 and 3.5)
I see. It may be that I'm having a compilation problem that is not specific to ppc64le. Perhaps this belongs in another thread, but thanks, avmgithub, for trying to help!
Here's the problem for anyone who's sees this: The compiler stalls on the line
[ 6%] Building C object CMakeFiles/TH.dir/THTensor.c.o
The compiler runs forever without running out of memory. Seems to be in some sort of infinite loop. Stack Exchange seems to think this is a known problem with gcc when using the -O* flags. However, I tried hacking the CMakeLists file to remove the -O flags, and the problem persists.
I tried building pytorch with gcc 5.4.0 on POWER8 (ppc64le) with Ubuntu 16.04.3. It works fine. I did not test all the test suites, but at least MNIST example runs quite well. I just ignored mkl and installed openblas instead as you recommended. You can find the log here.
It looks like libTH tries to dispatch to VSX-specific versions of the functions, which are not defined.