sakrejda / protostan

Thin protobuf interface wrapper for Stan
BSD 3-Clause "New" or "Revised" License
2 stars 1 forks source link

Binary protobuf writer #37

Closed sakrejda closed 8 years ago

sakrejda commented 8 years ago

To check on how much CPU is chewed up going from Eigen data types to Protocol Buffers in practice, do a simple writer which:

sakrejda commented 8 years ago

@ariddell I added the round-trip test I'm writing to the makefile manually but there'll be quite a few of them. Is there a way to make the makefile automatically do the same test compilation on a bunch of unit test files?

ariddell commented 8 years ago

Probably. make is generally flexible (if inscrutable).

sakrejda commented 8 years ago

pft, looks like having to use a bespoke method for serializing to file (write a varint32 for length, then write the message, then write varint32, then write message, etc...) makes it so that needs to be duplicated on the interface end too. At least it should be straightforward with CmdStan/ServeStan but I can't wait till this is merged into protobuf

ariddell commented 8 years ago

We know for a fact that jsonlines will be doable soon. Could we just do raw protobuf stream dumps for now?

sakrejda commented 8 years ago

Yep, running tests now. Nothing to make me complain like writing tests. I do it but not because I love it.

ariddell commented 8 years ago

Thanks! I'm looking forward to trying things out.

sakrejda commented 8 years ago

@ariddell Btw, in the round-trip (write to file, read from file, compare) tests for std::vector and std::vector I ran the tests with 100k values and (all inclusive, generating random strings/double, writing to file, reading to file, comparing, the timings are pretty good:

[ RUN      ] binaryProtoStreamWriter.roundTripVectorString
[       OK ] binaryProtoStreamWriter.roundTripVectorString (241 ms)
[ RUN      ] binaryProtoStreamWriter.roundTripVectorDouble
[       OK ] binaryProtoStreamWriter.roundTripVectorDouble (191 ms)
[----------] 7 tests from binaryProtoStreamWriter (433 ms total)
ariddell commented 8 years ago

This is great news.

ariddell commented 8 years ago

Are we going to drop the deliminted pb for now (and just wait for jsonlines)?

sakrejda commented 8 years ago

This format is delimited, just not JSON/text, check out the test for the writer(std::vector), which is how Stan writes parameters. The first part is just setting up the test:

TEST(binaryProtoStreamWriter,roundTripVectorDouble) {                                     
  std::random_device random; 
  std::mt19937 engine(random());                                                          
  std::uniform_real_distribution<> U(0, 1);                                               
  std::vector<double> original_value;
  uint n_doubles = 100000;
  for ( int i=0; i < n_doubles; ++i ) {
    original_value.push_back(U(engine));                                                  
  }       
  std::ofstream* ofs = new std::ofstream("/tmp/test-roundTripVectorDouble.pb", std::ios::trunc | std::ios::out | std::ios::binary);

This is where the writer takes ownership of the output file stream:

  stan::interface_callbacks::writer::binary_proto_stream_writer* writer = new stan::interface_callbacks::writer::binary_proto_stream_writer(ofs);                                    
  stan::proto::StanMessage pb;
  int fd;   
  fd = open("/tmp/test-roundTripVectorDouble.pb", O_RDONLY);                              
  google::protobuf::io::FileInputStream* pb_istream = new google::protobuf::io::FileInputStream(fd);
  bool success;                                                                           

This is where the writer actually writes to the file, taking std::vector as input, and re-using a single protobuf message internally:

  (*writer)(original_value);
  delete writer;                                                                          

Here (below) is where the single declared message ("pb") gets repeatedly used to read in values from the delimited file. The format is one google::protobuf::Varint32 indicating the size of the following message, followed by one protobuf message. "read_delimited_pb" is what a program would call to take the messages into Python or R.

  for ( uint i=0; i < n_doubles; ++i ) {                                                  
    success = stan::proto::read_delimited_pb(&pb, pb_istream);

The rest is just making sure the long format of the vector was read correctly (fixed key of "value", a single index for column, and the actual value).

    EXPECT_EQ(true, success);
    EXPECT_EQ("value", pb.stan_parameter_output().key());
    EXPECT_EQ(i, pb.stan_parameter_output().indexing(0));
    EXPECT_EQ(original_value[i], pb.stan_parameter_output().value());                     
  }                                                                                       
  delete pb_istream;                                                                      
  close(fd);
}       
sakrejda commented 8 years ago

Obviously I need to write some doc :)

I think it not worth it to write a whole JSON parser (or even plumb one in) since the JSON output format is coming down the pipe relatively soon. Maybe it would be worth it to figure out the utilities they have but I think this binary format should be pretty easy to attach to CmdStan for output and even plug into Python for input using the Python protobuf library.