protocolbuffers / protobuf

Protocol Buffers - Google's data interchange format
http://protobuf.dev
Other
65.24k stars 15.45k forks source link

UBSAN "index out of bounds" for generated C++ code #2073

Closed jeroen closed 4 years ago

jeroen commented 8 years ago

UBSAN shows index out of bounds errors for generated C++ code when setting the repeated fields. Is this a bug or am I using it incorrectly? An example test.proto:

syntax = "proto2";
package mytest;

message MYTEST {
  repeated double realValue = 2;
  repeated sint32 intValue = 3;
}

Compile to C++ using protoc test.proto --cpp_out=. Then main.cc looks like this:

#include "test.pb.h"

int main(){
  mytest::MYTEST x;
  x.add_realvalue(123.123);
  x.add_intvalue(123L);
  return 0;
}

Compile everything with sanitizer flags:

g++ -fsanitize=address,undefined,bounds-strict main.cc test.pb.cc \
  $(pkg-config --cflags --libs protobuf)

And then when I run it:

root@8e50c80421c4:~/test# ./a.out
/usr/include/google/protobuf/repeated_field.h:1289:35: runtime error: index 4 out of bounds for type 'double [1]'
/usr/include/google/protobuf/repeated_field.h:1289:35: runtime error: index 4 out of bounds for type 'int [1]'
/usr/include/google/protobuf/repeated_field.h:282:38: runtime error: index 4 out of bounds for type 'int [1]'
/usr/include/google/protobuf/repeated_field.h:282:38: runtime error: index 4 out of bounds for type 'double [1]'

I am using stock gcc and protobuf from Debian Testing:

root@8e50c80421c4:~/test# g++ --version
g++ (Debian 6.1.1-11) 6.1.1 20160802
Copyright (C) 2016 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

root@8e50c80421c4:~/test# protoc --version
libprotoc 3.0.0
xfxyjwf commented 8 years ago

In protobuf we defined the storage of a repeated field as:

  struct Rep {
    Arena* arena;
    Element elements[1];
  }

but actually used Rep* to point to a larger space and may access rep->elements[4]. I guess this triggers the UBSAN check.

No idea how to make UBSAN happy with that but the protobuf code is working as intended.

seishun commented 8 years ago

It seems UBSAN is being overzealous here. As far as I know, there is no section in the standard that explicitly disallows writing out of bounds if there is sufficient storage.

khingblue commented 8 years ago

@seishun I guess that's triggered by bounds-strict which passed to g++.

jwijffels commented 4 years ago

Hello there. I have the same issue with an R wrapper around the sentencepiece C++ library. https://github.com/google/sentencepiece That library also uses protobuf-lite and gives exactly the same UBSAN message (snippet of log at https://www.stats.ox.ac.uk/pub/bdr/memtests/gcc-UBSAN/sentencepiece/ shown below)

third_party/protobuf-lite/google/protobuf/repeated_field.h:1537:35: runtime error: index 1 out of bounds for type 'void *[1]'
    #0 0x7fa7d1e1b6d6 in google::protobuf::RepeatedPtrField<sentencepiece::ModelProto_SentencePiece>::TypeHandler::Type* google::protobuf::internal::RepeatedPtrFieldBase::Add<google::protobuf::RepeatedPtrField<sentencepiece::ModelProto_SentencePiece>::TypeHandler>(google::protobuf::RepeatedPtrField<sentencepiece::ModelProto_SentencePiece>::TypeHandler::Type*) third_party/protobuf-lite/google/protobuf/repeated_field.h:1537
    #1 0x7fa7d1e1b6d6 in google::protobuf::RepeatedPtrField<sentencepiece::ModelProto_SentencePiece>::Add() third_party/protobuf-lite/google/protobuf/repeated_field.h:1977
    #2 0x7fa7d1e1b6d6 in sentencepiece::ModelProto::add_pieces() sentencepiece/src/builtin_pb/sentencepiece_model.pb.h:3390
    #3 0x7fa7d1e1b6d6 in sentencepiece::TrainerInterface::Serialize(sentencepiece::ModelProto*) const sentencepiece/src/trainer_interface.cc:469
    #4 0x7fa7d1e5f72f in sentencepiece::TrainerInterface::SaveModel(absl::string_view) const sentencepiece/src/trainer_interface.cc:509
    #5 0x7fa7d1e6beee in sentencepiece::TrainerInterface::Save() const sentencepiece/src/trainer_interface.cc:547
    #6 0x7fa7d1c31230 in sentencepiece::character::Trainer::Train() sentencepiece/src/char_model_trainer.cc:57
    #7 0x7fa7d1ddb0d4 in sentencepiece::SentencePieceTrainer::Train(sentencepiece::TrainerSpec const&, sentencepiece::NormalizerSpec const&) sentencepiece/src/sentencepiece_trainer.cc:52
    #8 0x7fa7d1de55f7 in sentencepiece::SentencePieceTrainer::Train(sentencepiece::util::min_string_view) sentencepiece/src/sentencepiece_trainer.cc:120
    #9 0x7fa7d21475a9 in spc_train(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) /data/gannet/ripley/R/packages/tests-gcc-SAN/sentencepiece/src/rcpp_sentencepiece.cpp:11

@xfxyjwf / @seishun Is there going to be a fix for these address sanitizer issues?

jwijffels commented 4 years ago

@TeBoring Can we reopen this issue. I'm using protobuf in R package sentencepiece https://cran.r-project.org/package=sentencepiece which wraps https://github.com/google/sentencepiece from a google colleague of yours which uses protobuf. I'm not allowed not put an update of this R package to CRAN because of exactly this Address Sanitizer issue reported here.

acozzette commented 4 years ago

@jwijffels I think you are right that this is undefined behavior. It is called a "flexible array member" and is technically valid in C but not C++. However, I found from researching this that it appears to be a pretty common pattern and is supported by GCC, Clang, and MSVC. I'm not sure that we can justify spending the time to fix this just to be technically compliant with the standard, even though all major compilers already behave the expected way. The feature may end up in a future C++ standard anyway. Would the CRAN owners perhaps be receptive to making an exception if they knew more of the context around this?

jeroen commented 4 years ago

@acozzette the cran maintainers don't accept that. They specifically enforce that all code follows the standards, because it should also run on scientific supercomputers with very unusual architectures and compilers, such as intel icc, solaris studio, AIX, and so on. Therefore undefined behavior is not allowed, even if it still works in gcc/clang.

So if you believe this is valid behavior, it should be fixed in gcc-ubsan.

If you believe fixing this would negatively affect performance, you could condition it in a macro, such that we can compile with -D_STRICT_CXX_STANDARD or something like that so we can use that in the R bindings.

jwijffels commented 4 years ago

@acozzette I already explained cran about the expanding arrays but it was a clear 'no can't do any more'. Is this project already using OSS-Fuzz, this issue would be a great fit. Looking forward to seeing improvements on possible fixes for this address sanitizer issue.

acozzette commented 4 years ago

We now have a fix for this but it is just waiting to be synced from our internal repo to GitHub.

jwijffels commented 4 years ago

🥳

jeroen commented 4 years ago

@acozzette can you point us to the commit, and update this issue if it has been released?

jeroen commented 4 years ago

@acozzette has this landed yet?

acozzette commented 4 years ago

@jeroen This has now landed in commit 95e6c5b4746dd7474d540ce4fb375e3f79a086f8.

jeroen commented 4 years ago

Thanks. Will this only be part of the 4.0 release, or also get fixed in 3.x ?

acozzette commented 4 years ago

We were originally planning to release 4.0 but decided instead to just go with 3.13 for now (which we released at the end of last week). I'm not sure if our next release will be 3.14 or 4.0, but either way it will include this fix.