Interrupted by signal 11:SIGSEGV

amagooda commented 7 years ago

I got the issue "Process finished with exit code 139 (interrupted by signal 11: SIGSEGV)" when i am trying to reproduce the same results over the ukp data set.

The problem appears while running exp_train_test.py using the arguments "ukp --method rnn-struct --model strict [--dynet-seed=42]"

The console output is as follows:

[dynet] random seed: 3694361057 [dynet] allocating memory: 512MB [dynet] memory allocation done. 2017-07-18 12:27:07,154 - root - INFO - rnn-struct strict on ukp ({'max_iter': 10, 'mlp_dropout': 0.15}) 2017-07-18 12:27:13,659 - root - INFO - Setting node class weights Claim: 1.0, MajorClaim: 1.0, Premise: 1.0 2017-07-18 12:27:13,660 - root - INFO - Setting link class weights False: 1.0, True: 4.725530458590007 2017-07-18 12:27:13,660 - root - INFO - Overriding n_embeds to glove size 300 2017-07-18 12:27:13,671 - root - INFO - Initializing embeddings... 2017-07-18 12:27:13,799 - root - INFO - ...done

Process finished with exit code 139 (interrupted by signal 11: SIGSEGV)

Do you know what can be causing this problem ?, and i am using dynet v1.1

vene commented 7 years ago

Hi!

I have never seen this particular issue before, but we should be able to get to the bottom of it.

First: this probably has nothing to do with the error, but you should not include the brackets in [--dynet-seed=42]. In the usage string, the brackets are a convention denoting that the argument is optional. As evidence, the first line of output should not say [dynet] random seed: 3694361057 but [dynet] random seed: 42 if the seed is set correctly. Try removing the brackets.

Second, have you tried other configurations, for instance --model=bare, --model=full, or --method=rnn --model=bare? Do those also trigger the issue?

Finally, to pinpoint what triggers the segfault, could you try turning on the Python debugger by adding the line import pdb; pdb.set_trace() to the exp_train_test.py file, and then, when running it, proceeding step-by-step using s until you encounter the error, and then let me know what line the error occurs at?

The error should come from either ad3, dynet, or (very unlikely) pystruct. Could you also tell me how you installed all these 3 libs?

Thanks!

amagooda commented 7 years ago

So, I ran it using --method=rnn --model=bare and it worked.

I tried tracing the code to find the line that triggers the issue. and i think this is the one

y_hat, status = self._inference(doc, potentials, relaxed=True, exact=self.exact_inference, constraints=self.constraints) line 500 in argrnn.py.

regarding installing dynet,i installed it following the manual installation process in here "http://dynet.readthedocs.io/en/latest/python.html", after i downloaded version 1.1 instead of 2
ad3: i installed this version "http://www.cs.cmu.edu/~ark/AD3/"
pystruct: i installed it using (either pip or conda) on anaconda

vene commented 7 years ago

Thanks, your analysis is great!

Both signs point to the fact that the AD3 inference is the culprit. In particular, --method=rnn --model=bare does not use AD3 inference at all, which is why you don't see the error.

At the moment marseille requires a few changes in the ad3 python wrapper, so the current release from the website you linked does not work. Please uninstall your current version of ad3 and then install the one from my fork here. I am working on making a new release of ad3 more easily available and easier to install. If you are having issues installing the version from my fork, let me know. Thanks!

amagooda commented 7 years ago

I installed the AD3 version you sent me, i am still facing the same issue while running the "strict" variant.

vene commented 7 years ago

Hmm, maybe there are some issues with your AD3 install. Can you try running the AD3 python examples and the python unit tests?

It might be worth trying to install all the dependencies in a fresh, empty virtualenv to make sure that old versions are not accidentally used.

amagooda commented 7 years ago

I made sure that i am using the fresh installation of the AD3, then I tried running two examples (example.py & example_grid_diversity.py). I also tried the two test files (test_basic.py & test_pystruct.py)

And everything works just fine.

vene commented 7 years ago

Yet the error with Marseille is still there?

This is odd. It would be great if you could still try installing everything in a fresh virtualenv. What OS are you using?

amagooda commented 7 years ago

Linux, Ubuntu

vene commented 7 years ago

That is exactly the same as what I am using, so it is probably not about that. Let me know what the results are in a fresh virtualenv.

BTW, what happens if you use cdcp instead of ukp (but still with rnn-struct strict)? How about the linear-struct strict models?

amagooda commented 7 years ago

I still didn't try cdcp, however I tried the linear-struct strict model. It fails too, the output is as follows

[dynet] random seed: 2656436439 [dynet] allocating memory: 512MB [dynet] memory allocation done. 2017-07-27 18:44:48,226 - root - INFO - linear-struct strict on ukp ({'C': 0.03}) 2017-07-27 18:46:24,845 - root - INFO - Setting node class weights Claim: 1.0, MajorClaim: 1.0, Premise: 1.0 2017-07-27 18:46:24,845 - root - INFO - Setting link class weights False: 1.0, True: 4.801313628899836 2017-07-27 18:46:24,845 - root - INFO - Joint feature size: 29033 Iteration 0

Process finished with exit code 139 (interrupted by signal 11: SIGSEGV)

vene commented 7 years ago

I just tried making an empty virtualenv and installing all the dependencies from scratch, and I still could not reproduce this problem.

What version of python are you using?

When you stepped through the code via the debugger, did it manage to get through any documents before crashing, or does it crash at the very first call to inference?

In any case I am working on making AD3 a bit safer to naked memory accesses, which might help pinpoint what's going on here. I plan to make a new release soon.

vene commented 7 years ago

I just released AD3 v2.1 which can be installed with pip install --upgrade ad3. Would you mind trying again using this release?

vene / marseille

Interrupted by signal 11:SIGSEGV #2

y_hat, status = self._inference(doc, potentials, relaxed=True, exact=self.exact_inference, constraints=self.constraints) line 500 in argrnn.py.