tensorflow / models

Models and examples built with TensorFlow
Other
77.25k stars 45.75k forks source link

[Syntaxnet]Bazel test failure -> ParseyMcParseface unusable due to error type_name: tensorflow::Tensor already registered #2355

Closed AyushiAggarwal closed 7 years ago

AyushiAggarwal commented 7 years ago

System information

Problem:

I have tried installing Syntaxnet to run ParseyMcParseface on CentOS 6/7 and Ubuntu 14.04 LTS and was stuck with the same error(described below). I followed the instructions for manual installation of syntaxnet given at https://github.com/tensorflow/models/tree/master/syntaxnet and ran the bazel test using the following command: bazel test --linkopt=-lrt syntaxnet/... util/utf8/...

Console output: Executed 25 out of 25 tests: 19 tests pass and 6 fail locally

Screenshot of tests that failed is given below:

image

Log File For each of the failures, the test.log file shows the following error: F external/org_tensorflow/tensorflow/core/framework/variant_op_registry.cc:79] Check failed: existing == nullptr (0x10a0f30 vs. nullptr)Unary VariantDecodeFn for type_name: tensorflow::Tensor already registered Aborted

  1. Is there an issue with the installation process described in the README given at https://github.com/tensorflow/models/tree/master/syntaxnet?
  2. It appears that the "tensorflow" type_name is not generated as expected. Could this be an issue with generation of this type_name during the build process? If so, could you provide a fix since this is rendering ParseyMcParseface unusable as of now.
  3. If someone is able to install and successfully run ParseyMcParseface with the current master branch of SyntaxNet or in any other way, please let me know.

Struggled with this for days now and any help/fix would be appreciated. Nothing found on the bazel stackoverflow channel.

charlesjohannisen commented 7 years ago

Got the exact failures. Found the commit where this change was made to tensorflow. Slight ( very temporary and janky ) workaround, after you've git cloned models: cd models/syntaxnet/tensorflow/tensorflow git checkout 712fcfc6e364e6ca39cee3d988089e51f73d1e65 - which is before this commit nano models/syntaxnet/tensorflow/tensorflow/core/platform/default/mutex.h change this: #include "nsync_cv.h" #include "nsync_mu.h" to this: #include "../nsync/public/nsync_cv.h" #include "../nsync/public/nsync_mu.h" The workaround doesn't solve the problem completely, since the last test still fails for me. But ParseyMcParseface ( /opt/tensorflow/syntaxnet/syntaxnet/demo.sh ) works at least for the purposes of my project.

reedwm commented 7 years ago

/CC @ebrevdo @calberti

ebrevdo commented 7 years ago

This is caused by a new feature, the variant op registrar, being called more than once. Perhaps framework/tensor.cc is being linked in or compiled multiple times?

AyushiAggarwal commented 7 years ago

I followed @charlesjohannisen 's instructions to set the head to the version before the errant commit( that you referenced above). This seems to have eliminated the error "type_name: tensorflow::Tensor already registered".

However, my bazel test now fails with a new error - ImportError: No module named autograd found.

Log file: image

Console output: Executed 25 out of 25 tests: 17 tests pass and 8 fail locally

Screenshot of tests that failed is given below: image

This is still an ops issue. Further ideas on how this can be handled?

@charlesjohannisen - Can you provide the list of commands that you executed after the fix to the models/syntaxnet/tensorflow/tensorflow/core/platform/default/mutex.h file? This is to compare my installation steps and try to understand the ImportError.

charlesjohannisen commented 7 years ago

sudo apt install gfortran then sudo python -m pip install autograd should fix the last issue. Hopefully getting you back on track.

AyushiAggarwal commented 7 years ago

@charlesjohannisen Yes, that solved the issue! Thanks!

Also had to install enum and enum34 bazel test ran fine: Executed 25 out of 25 tests: 25 tests pass

25 tests pass

ParseyMcParseface demo.sh works as expected.

Awaiting bug fix and the corresponding modification to the official README.

rekcahd commented 7 years ago

Looks like this fix doesn't work after restructure of repo. Will it be fixed?

cernerae commented 7 years ago

Same issue. And like @rekcahd said, the file fix doesn't work after the directory structure was changed. I guess within the last month

rfeldercyc commented 7 years ago

I am also experiencing this issue.

JohanWu commented 7 years ago

same issue here.

MohammadMoradi commented 7 years ago

Same problem!

raguhari commented 7 years ago

Same problem. Any updates on issue status?

GelRa commented 7 years ago

I managed to build and test without errors. To do this I just commented CHECK_EQ(existing, nullptr) in: RegisterShapeFn, RegisterDecodeFn, RegisterUnaryOpFn, RegisterBinaryOpFn

Furthermore I had to install the python package: apt-get install graphviz libgraphviz-dev pip install pygraphviz --install-option="--include-path=/usr/include/graphviz" --install-option="--library-path=/usr/lib/graphviz/"

Strangely I hat to modify syntaxnet/dragnn/python/component.py In: def build_greedy_training(self, state, network_states): from: with tf.control_dependencies([tf.assert_equal(self.training_beam_size, 1)]): stride = state.current_batch_size self.training_beam_size to: val = tf.Print(self.training_beam_size, [ self.training_beam_size ], "Fix for access bug. Correct value: ") with tf.control_dependencies([tf.assert_equal(val, 1)]): stride = state.current_batch_size self.training_beam_size

Just adding the print command fixes the error. Without the print the value of self.training_beam_size seems to be 8 but is 1 in truth. The print convinces the system to use the correct vale. Very very strange. EDIT 1: Here is the diff-file: models_diff.txt

EDIT 2: Just a hint for the guys wanting to really fix the problem with CHECK_EQ() models/research/syntaxnet/tensorflow/tensorflow/core/framework/tensor.cc registers statically REGISTER_UNARY_VARIANT_DECODE_FUNCTION(Tensor, "tensorflow::Tensor"); models/research/syntaxnet/tensorflow/tensorflow/core/framework/variant_op_registry.cc registers statically REGISTER_VARIANT_SHAPE_TYPE(int); ... etc.

both codes are included in _pywrap_tensorflow_internal.so and at least partly in parser_ops.so During test _pywrap_tensorflow_internal.so is initialed by _PyImport_LoadDynamicModule and by tensorflow::LoadLibrary So the static code is run twice causing the error. As far as I can see it, commenting the CHECK_EQ does not cause any harm in this case, due to the nature of this registration. I think the solution would be moving the static code to a different location not to be executed twice or changing the bazel build scrips not to include the same code in two different shared libraries.

The problem with syntaxnet/dragnn/python/component.py seems to me to be to be a serious problem inside the tensorflow core or with the python to c++ connection. Happy to learn any better explanation.

devnullnor commented 7 years ago

With @charlesjohannisen and @GelRa's solution, all bazel tests are success for me on Ubuntu 16.04 now.

asimshankar commented 7 years ago

@calberti @andorardo @bogatyy @markomernick - Mind taking a look to make the appropriate fix?

ebrevdo commented 7 years ago

I wonder if we can just remove the tensor decoder registration in TF proper. I'm not sure if it's used for anything other than testing right now. However it would be good to fix this in parsey as well.

On Fri, Nov 3, 2017 at 12:14 AM, Asim Shankar notifications@github.com wrote:

@calberti https://github.com/calberti @andorardo https://github.com/andorardo @bogatyy https://github.com/bogatyy @markomernick https://github.com/markomernick - Mind taking a look to make the appropriate fix?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tensorflow/models/issues/2355#issuecomment-341634734, or mute the thread https://github.com/notifications/unsubscribe-auth/ABtimxx1h7tAyvoAhlNEFgd6lZTTnaNmks5syr1igaJpZM4PQKFc .

bogatyy commented 7 years ago

Thank you for reporting the issue in detail and looking at the possible workarounds. We are working on this.

bogatyy commented 7 years ago

Could you try again? While a longer-term solution is not ready, we have a fix out.

More specifically, synced the TF subrepo to a version before the registration mechanism. Right now, SyntaxNet should build with Bazel 0.5.4 (this is mentioned in the README as well).

SaintNazaire commented 7 years ago

Sorry @bogatyy but despite your updates, I get 25 local fails, suggest to reopen issue. (side-note, I am installing with Tensorflow 1.4 was it your case?)

When introspecting, main error source is dependency on autograd python package

cat /root/.cache/bazel/_bazel_root/3b4c7ccb85580bc382ce4a52e9580003/execroot/__main__/bazel-out/local-opt/testlogs/syntaxnet/util/resources_test/test.log

from autograd import core as ag_core ImportError: No module named autograd

Upon fixing with pip install autograd succesfully installs the package and throws a name importe error

cat /root/.cache/bazel/_bazel_root/3b4c7ccb85580bc382ce4a52e9580003/execroot/__main__/bazel-out/local-opt/testlogs/syntaxnet/util/resources_test/test.log

from autograd import container_types ImportError: cannot import name container_types

Would you have any idea when possibly an actual fix would be out?

bogatyy commented 7 years ago

@SaintNazaire you need to install a compatible version of autograd, as explained in the README: pip install autograd==1.1.13

Let me know if that works (also, again, make sure you have Bazel 0.5.4)

SaintNazaire commented 7 years ago

@bogatyy it works thank you very much.

Fully tested docker file for your reference.

Warning: requires at least 3840Mb RAM to build locally on 3 CPUs, e.g. on Windows 10 machine click on the task bar hidden icons > right click on Docker icon > settings. Building on Docker Hub is limited to 2Gb and will fail.

bogatyy commented 7 years ago

Great to hear, closing the issue then.

qiaohaijun commented 6 years ago

Same problem!

matthewstidham commented 5 years ago

I managed to build and test without errors. To do this I just commented CHECK_EQ(existing, nullptr) in: RegisterShapeFn, RegisterDecodeFn, RegisterUnaryOpFn, RegisterBinaryOpFn

Furthermore I had to install the python package: apt-get install graphviz libgraphviz-dev pip install pygraphviz --install-option="--include-path=/usr/include/graphviz" --install-option="--library-path=/usr/lib/graphviz/"

Strangely I hat to modify syntaxnet/dragnn/python/component.py In: def build_greedy_training(self, state, network_states): from: with tf.control_dependencies([tf.assert_equal(self.training_beam_size, 1)]): stride = state.current_batch_size self.training_beam_size to: val = tf.Print(self.training_beam_size, [ self.training_beam_size ], "Fix for access bug. Correct value: ") with tf.control_dependencies([tf.assert_equal(val, 1)]): stride = state.current_batch_size self.training_beam_size

Just adding the print command fixes the error. Without the print the value of self.training_beam_size seems to be 8 but is 1 in truth. The print convinces the system to use the correct vale. Very very strange. EDIT 1: Here is the diff-file: models_diff.txt

EDIT 2: Just a hint for the guys wanting to really fix the problem with CHECK_EQ() models/research/syntaxnet/tensorflow/tensorflow/core/framework/tensor.cc registers statically REGISTER_UNARY_VARIANT_DECODE_FUNCTION(Tensor, "tensorflow::Tensor"); models/research/syntaxnet/tensorflow/tensorflow/core/framework/variant_op_registry.cc registers statically REGISTER_VARIANT_SHAPE_TYPE(int); ... etc.

both codes are included in _pywrap_tensorflow_internal.so and at least partly in parser_ops.so During test _pywrap_tensorflow_internal.so is initialed by _PyImport_LoadDynamicModule and by tensorflow::LoadLibrary So the static code is run twice causing the error. As far as I can see it, commenting the CHECK_EQ does not cause any harm in this case, due to the nature of this registration. I think the solution would be moving the static code to a different location not to be executed twice or changing the bazel build scrips not to include the same code in two different shared libraries.

The problem with syntaxnet/dragnn/python/component.py seems to me to be to be a serious problem inside the tensorflow core or with the python to c++ connection. Happy to learn any better explanation.

This fixed my problem, thanks!