tensorflow / lingvo

Lingvo
Apache License 2.0
2.81k stars 445 forks source link

Cannot import py_camera_model_ops from waymo_open_dataset.camera.ops #298

Closed Rtakaha closed 2 years ago

Rtakaha commented 2 years ago

Hi, I am trying to reproduce results of DeepFusion with Waymo Open Dataset. I have fixed the issue mentioned here by downgrading tensorflow from 2.9 to 2.7. But it fails to be executed when I try the following command. It crashes with segmentation fault.

bazel-bin/lingvo/trainer --logtostderr --model=car.waymo_deepfusion.DeepFusionCenterPointPed --mode=sync --logdir=/tmp/deepfusion --run_locally=gpu

I found out that the code fails when it tries to import py_camera_model_ops from waymo_open_dataset.camera.ops.

from waymo_open_dataset.camera.ops import py_camera_model_ops

This problem is reproducible in the waymo tutorial, when tensorflow==2.7.*, and it does not happen when tensorflow==2.6.0.

I tried building lingvo/trainer with tensorflow==2.6.0, but it fails with the following error:

ERROR: /tmp/lingvo/lingvo/core/ops/BUILD:182:18: Compiling lingvo/core/ops/input_common.cc failed: (Exit 1): gcc fail
ed: error executing command /usr/bin/gcc -U_FORTIFY_SOURCE -fstack-protector -Wall -Wunused-but-set-parameter -Wno-fr
ee-nonheap-object -fno-omit-frame-pointer -g0 -O2 '-D_FORTIFY_SOURCE=1' -DNDEBUG -ffunction-sections ... (remaining 6
5 argument(s) skipped)  

This issue looked similar, so I tried again after installing tensorstore, but it didn't work (same error).

I use docker/dev.dockerfile for my environment.

Do you know how I can fix this?

Rtakaha commented 2 years ago

Full error message when I try building lingvo/trainer.py with tensorflow==2.6.0.

root@dc6418f0d953:/tmp/lingvo# bazel build -c opt --config=cuda --copt=-D_GLIBCXX_USE_CXX11_ABI=0 //lingvo:trainer                        
DEBUG: Rule 'subpar' indicated that a canonical reproducible form can be obtained by modifying arguments commit = "35bb9f0092f71ea56b742a5
20602da9b3638a24f", shallow_since = "1557863961 -0400" and dropping ["tag"]                                                               
DEBUG: Repository subpar instantiated at:                                                                                                 
  /tmp/lingvo/WORKSPACE:12:15: in <toplevel>                                                                                              
Repository rule git_repository defined at:                                                                                                
  /root/.cache/bazel/_bazel_root/17eb95f0bc03547f4f1319e61997e114/external/bazel_tools/tools/build_defs/repo/git.bzl:199:33: in <toplevel>
INFO: Analyzed target //lingvo:trainer (0 packages loaded, 0 targets configured).                                                         
INFO: Found 1 target...                                                                                                                   
ERROR: /tmp/lingvo/lingvo/core/ops/BUILD:182:18: Compiling lingvo/core/ops/input_common.cc failed: (Exit 1): gcc failed: error executing c
ommand /usr/bin/gcc -U_FORTIFY_SOURCE -fstack-protector -Wall -Wunused-but-set-parameter -Wno-free-nonheap-object -fno-omit-frame-pointer 
-g0 -O2 '-D_FORTIFY_SOURCE=1' -DNDEBUG -ffunction-sections ... (remaining 66 argument(s) skipped)

Use --sandbox_debug to see verbose messages from the sandbox gcc failed: error executing command /usr/bin/gcc -U_FORTIFY_SOURCE -fstack-pr
otector -Wall -Wunused-but-set-parameter -Wno-free-nonheap-object -fno-omit-frame-pointer -g0 -O2 '-D_FORTIFY_SOURCE=1' -DNDEBUG -ffunctio
n-sections ... (remaining 66 argument(s) skipped)                    

Use --sandbox_debug to see verbose messages from the sandbox
In file included from lingvo/core/ops/input_common.cc:16:0:
./lingvo/core/ops/input_common.h:143:55: error: expected class-name before '{' token
 class InputResource : public tensorflow::ResourceBase {
                                                       ^
./lingvo/core/ops/input_common.h: In member function 'void tensorflow::lingvo::InputOpV2Create<RecordProcessorClass>::Compute(tensorflow::
OpKernelContext*)':               
./lingvo/core/ops/input_common.h:228:25: error: 'MakeRefCountingHandle' is not a member of 'tensorflow::ResourceHandle'
         ResourceHandle::MakeRefCountingHandle(resource, ctx->device()->name(),
                         ^~~~~~~~~~~~~~~~~~~~~
./lingvo/core/ops/input_common.h: In member function 'void tensorflow::lingvo::InputOpV2GetNext<RecordProcessorClass>::Compute(tensorflow:
:OpKernelContext*)':              
./lingvo/core/ops/input_common.h:252:28: error: 'const class tensorflow::ResourceHandle' has no member named 'GetResource'
     auto statusor = handle.GetResource<resource_type>();
                            ^~~~~~~~~~~
./lingvo/core/ops/input_common.h:252:53: error: expected primary-expression before '>' token
     auto statusor = handle.GetResource<resource_type>();
                                                     ^
./lingvo/core/ops/input_common.h:252:55: error: expected primary-expression before ')' token
     auto statusor = handle.GetResource<resource_type>();
                                                       ^
Target //lingvo:trainer failed to build
Use --verbose_failures to see the command lines of failed build steps.
INFO: Elapsed time: 8.990s, Critical Path: 8.69s
INFO: 21 processes: 5 internal, 16 processwrapper-sandbox.
FAILED: Build did NOT complete successfully
Rtakaha commented 2 years ago

Error message with --sandbox_debug, --verbose_failures.

root@dc6418f0d953:/tmp/lingvo# bazel build -c opt --config=cuda --copt=-D_GLIBCXX_USE_CXX11_ABI=0 //lingvo:trainer --sandbox_debug --verbose_failures                                                                                                                               
DEBUG: Rule 'subpar' indicated that a canonical reproducible form can be obtained by modifying arguments commit = "35bb9f0092f71ea56b742a5
20602da9b3638a24f", shallow_since = "1557863961 -0400" and dropping ["tag"]                                                    
DEBUG: Repository subpar instantiated at:                                                                                                 
  /tmp/lingvo/WORKSPACE:12:15: in <toplevel>                                                                                              
Repository rule git_repository defined at:                                                                                                
  /root/.cache/bazel/_bazel_root/17eb95f0bc03547f4f1319e61997e114/external/bazel_tools/tools/build_defs/repo/git.bzl:199:33: in <toplevel>
INFO: Analyzed target //lingvo:trainer (0 packages loaded, 0 targets configured).                                                         
INFO: Found 1 target...                                                                                                                   
ERROR: /tmp/lingvo/lingvo/core/ops/BUILD:359:22: Compiling lingvo/core/ops/generic_input_op_kernels.cc failed: (Exit 1): process-wrapper f
ailed: error executing command                                                                                                            
  (cd /root/.cache/bazel/_bazel_root/17eb95f0bc03547f4f1319e61997e114/sandbox/processwrapper-sandbox/196/execroot/__main__ && \           
  exec env - \                                                                                                                            
    LD_LIBRARY_PATH=/usr/local/nvidia/lib:/usr/local/nvidia/lib64 \                                                                       
    PATH=/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin \                         
    PWD=/proc/self/cwd \                                                                                                                  
    TMPDIR=/tmp \              
  /root/.cache/bazel/_bazel_root/install/1a4a2fac02d50c77031d44c0d91b8920/process-wrapper '--timeout=0' '--kill_delay=15' /usr/bin/gcc -U_
FORTIFY_SOURCE -fstack-protector -Wall -Wunused-but-set-parameter -Wno-free-nonheap-object -fno-omit-frame-pointer -g0 -O2 '-D_FORTIFY_SOU
RCE=1' -DNDEBUG -ffunction-sections -fdata-sections '-std=c++0x' -MD -MF bazel-out/k8-opt/bin/lingvo/core/ops/_objs/generic_input_op_kerne
ls/generic_input_op_kernels.pic.d '-frandom-seed=bazel-out/k8-opt/bin/lingvo/core/ops/_objs/generic_input_op_kernels/generic_input_op_kern
els.pic.o' -fPIC -iquote . -iquote bazel-out/k8-opt/bin -iquote external/tensorflow_includes -iquote bazel-out/k8-opt/bin/external/tensorf
low_includes -iquote external/absl_includes -iquote bazel-out/k8-opt/bin/external/absl_includes -iquote external/eigen_archive -iquote baz
el-out/k8-opt/bin/external/eigen_archive -iquote external/protobuf_archive -iquote bazel-out/k8-opt/bin/external/protobuf_archive -iquote 
external/zlib_includes -iquote bazel-out/k8-opt/bin/external/zlib_includes -iquote external/tensorflow_solib -iquote bazel-out/k8-opt/bin/
external/tensorflow_solib -isystem external/tensorflow_includes/tensorflow_includes -isystem bazel-out/k8-opt/bin/external/tensorflow_incl
udes/tensorflow_includes -isystem external/absl_includes/absl -isystem bazel-out/k8-opt/bin/external/absl_includes/absl -isystem external/
eigen_archive/tf_includes -isystem bazel-out/k8-opt/bin/external/eigen_archive/tf_includes -isystem external/protobuf_archive/tf_includes 
-isystem bazel-out/k8-opt/bin/external/protobuf_archive/tf_includes -isystem external/zlib_includes/zlib -isystem bazel-out/k8-opt/bin/ext
ernal/zlib_includes/zlib '-D_GLIBCXX_USE_CXX11_ABI=0' '-D_GLIBCXX_USE_CXX11_ABI=0' '-std=c++14' -Wno-sign-compare -mavx '-DGOOGLE_CUDA=1' 
-fno-canonical-system-headers -Wno-builtin-macro-redefined '-D__DATE__="redacted"' '-D__TIMESTAMP__="redacted"' '-D__TIME__="redacted"' -c
 lingvo/core/ops/generic_input_op_kernels.cc -o bazel-out/k8-opt/bin/lingvo/core/ops/_objs/generic_input_op_kernels/generic_input_op_kerne
ls.pic.o) process-wrapper failed: error executing command 
  (cd /root/.cache/bazel/_bazel_root/17eb95f0bc03547f4f1319e61997e114/sandbox/processwrapper-sandbox/196/execroot/__main__ && \
  exec env - \
    LD_LIBRARY_PATH=/usr/local/nvidia/lib:/usr/local/nvidia/lib64 \
    PATH=/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin \
    PWD=/proc/self/cwd \
    TMPDIR=/tmp \
  /root/.cache/bazel/_bazel_root/install/1a4a2fac02d50c77031d44c0d91b8920/process-wrapper '--timeout=0' '--kill_delay=15' /usr/bin/gcc -U_
FORTIFY_SOURCE -fstack-protector -Wall -Wunused-but-set-parameter -Wno-free-nonheap-object -fno-omit-frame-pointer -g0 -O2 '-D_FORTIFY_SOU
RCE=1' -DNDEBUG -ffunction-sections -fdata-sections '-std=c++0x' -MD -MF bazel-out/k8-opt/bin/lingvo/core/ops/_objs/generic_input_op_kerne
ls/generic_input_op_kernels.pic.d '-frandom-seed=bazel-out/k8-opt/bin/lingvo/core/ops/_objs/generic_input_op_kernels/generic_input_op_kern
els.pic.o' -fPIC -iquote . -iquote bazel-out/k8-opt/bin -iquote external/tensorflow_includes -iquote bazel-out/k8-opt/bin/external/tensorf
low_includes -iquote external/absl_includes -iquote bazel-out/k8-opt/bin/external/absl_includes -iquote external/eigen_archive -iquote baz
el-out/k8-opt/bin/external/eigen_archive -iquote external/protobuf_archive -iquote bazel-out/k8-opt/bin/external/protobuf_archive -iquote 
external/zlib_includes -iquote bazel-out/k8-opt/bin/external/zlib_includes -iquote external/tensorflow_solib -iquote bazel-out/k8-opt/bin/
external/tensorflow_solib -isystem external/tensorflow_includes/tensorflow_includes -isystem bazel-out/k8-opt/bin/external/tensorflow_incl
udes/tensorflow_includes -isystem external/absl_includes/absl -isystem bazel-out/k8-opt/bin/external/absl_includes/absl -isystem external/
eigen_archive/tf_includes -isystem bazel-out/k8-opt/bin/external/eigen_archive/tf_includes -isystem external/protobuf_archive/tf_includes 
-isystem bazel-out/k8-opt/bin/external/protobuf_archive/tf_includes -isystem external/zlib_includes/zlib -isystem bazel-out/k8-opt/bin/ext
ernal/zlib_includes/zlib '-D_GLIBCXX_USE_CXX11_ABI=0' '-D_GLIBCXX_USE_CXX11_ABI=0' '-std=c++14' -Wno-sign-compare -mavx '-DGOOGLE_CUDA=1' 
-fno-canonical-system-headers -Wno-builtin-macro-redefined '-D__DATE__="redacted"' '-D__TIMESTAMP__="redacted"' '-D__TIME__="redacted"' -c
 lingvo/core/ops/generic_input_op_kernels.cc -o bazel-out/k8-opt/bin/lingvo/core/ops/_objs/generic_input_op_kernels/generic_input_op_kerne
ls.pic.o)
In file included from lingvo/core/ops/generic_input_op_kernels.cc:20:0:
./lingvo/core/ops/input_common.h:143:55: error: expected class-name before '{' token
 class InputResource : public tensorflow::ResourceBase {
                                                       ^
./lingvo/core/ops/input_common.h: In member function 'void tensorflow::lingvo::InputOpV2Create<RecordProcessorClass>::Compute(tensorflow::
OpKernelContext*)':
./lingvo/core/ops/input_common.h:228:25: error: 'MakeRefCountingHandle' is not a member of 'tensorflow::ResourceHandle'
         ResourceHandle::MakeRefCountingHandle(resource, ctx->device()->name(),
                         ^~~~~~~~~~~~~~~~~~~~~
./lingvo/core/ops/input_common.h: In member function 'void tensorflow::lingvo::InputOpV2GetNext<RecordProcessorClass>::Compute(tensorflow:
:OpKernelContext*)':
./lingvo/core/ops/input_common.h:252:28: error: 'const class tensorflow::ResourceHandle' has no member named 'GetResource'
     auto statusor = handle.GetResource<resource_type>();
                            ^~~~~~~~~~~
./lingvo/core/ops/input_common.h:252:53: error: expected primary-expression before '>' token
     auto statusor = handle.GetResource<resource_type>();
                                                     ^
./lingvo/core/ops/input_common.h:252:55: error: expected primary-expression before ')' token
     auto statusor = handle.GetResource<resource_type>();
                                                       ^
./lingvo/core/ops/input_common.h: In instantiation of 'class tensorflow::lingvo::InputResource<tensorflow::lingvo::{anonymous}::GenericInp
utProcessor>':
./lingvo/core/ops/input_common.h:259:15:   required from 'void tensorflow::lingvo::InputOpV2GetNext<RecordProcessorClass>::Compute(tensorf
low::OpKernelContext*) [with RecordProcessorClass = tensorflow::lingvo::{anonymous}::GenericInputProcessor]'
lingvo/core/ops/generic_input_op_kernels.cc:369:1:   required from here
./lingvo/core/ops/input_common.h:165:15: error: 'std::string tensorflow::lingvo::InputResource<RecordProcessorClass>::DebugString() const 
[with RecordProcessorClass = tensorflow::lingvo::{anonymous}::GenericInputProcessor; std::string = std::basic_string<char>]' marked 'overr
ide', but does not override
   std::string DebugString() const override { return "lingvo InputResource"; }
               ^~~~~~~~~~~
./lingvo/core/ops/input_common.h:167:3: error: 'tensorflow::lingvo::InputResource<RecordProcessorClass>::~InputResource() [with RecordProc
essorClass = tensorflow::lingvo::{anonymous}::GenericInputProcessor]' marked 'override', but does not override
   ~InputResource() override { delete batcher_; }
   ^
./lingvo/core/ops/input_common.h: At global scope:
./lingvo/core/ops/input_common.h:169:8: warning: 'void tensorflow::lingvo::InputResource<RecordProcessorClass>::GetNext(tensorflow::OpKern
elContext*) [with RecordProcessorClass = tensorflow::lingvo::{anonymous}::GenericInputProcessor]' used but never defined
   void GetNext(OpKernelContext* ctx) {
        ^~~~~~~
Target //lingvo:trainer failed to build
INFO: Elapsed time: 11.250s, Critical Path: 11.06s
INFO: 17 processes: 5 internal, 12 processwrapper-sandbox.
FAILED: Build did NOT complete successfully                               
Rtakaha commented 2 years ago

I built waymo-open-dataset-tf-2-7-0 by myself, and I was able to build lingvo/trainer with tensorflow==2.7.0.

Closing issue.

ref: https://github.com/waymo-research/waymo-open-dataset/issues/548