waymo-research / waymo-open-dataset

Waymo Open Dataset
https://www.waymo.com/open
Other
2.7k stars 609 forks source link

Error using data-conversion/scenario_conversion parsing text-format waymo.open_dataset. Scenario: 1:2: Interpreting non-ascii codepoint 192 #653

Open jxmmy7777 opened 1 year ago

jxmmy7777 commented 1 year ago

I am encountering an error while using the Waymo Open Dataset conversion library to convert a Waymo scenario file to a TensorFlow Example and export it to a tfrecord file. When running my code using bazel build, I get the following error:

Error parsing text-format waymo.open_dataset.Scenario: 1:2: Interpreting non-ascii codepoint 192

I suspect the issue might be related to the encoding of the input file. Specifically, I am reading the scenario file in binary mode, but it might be encoded in a non-standard format. Any suggestions would be helpful.

Here's a code snippet that reproduces the error:


#include <vector>
#include <string>
#include <fstream> 

#include "waymo_open_dataset/data_conversion/scenario_conversion.h"
#include "waymo_open_dataset/protos/conversion_config.pb.h"

#include "absl/strings/str_cat.h"
#include "tensorflow/core/example/example.pb.h"
#include "tensorflow/core/lib/io/record_writer.h"
#include "tensorflow/core/platform/env.h"

#include "google/protobuf/text_format.h"
#include "google/protobuf/io/zero_copy_stream_impl.h"

int main() {
    // Load the input data from file (assuming scenario.pbtxt and config.pbtxt exist).
    std::string scenario_file_path = 
    std::string scenario_file_path = "path/to/waymo_open_dataset_motion_v_1_2_0/training_20s/training_20s.tfrecord-00667-of-01000";
    waymo::open_dataset::Scenario scenario;
    // Read the scenario from file.
    std::ifstream input(scenario_file_path, std::ios::in | std::ios::binary);
    if (!input) {
        std::cerr << "Failed to open " << scenario_file_path << std::endl;
        return 1;
    }

    // Print the contents_print of the file.
    std::stringstream buffer;
    buffer << input.rdbuf();
    std::string contents = buffer.str();

    // parse the text format from the string
    if (!google::protobuf::TextFormat::ParseFromString(
        contents, &scenario)) {
        std::cerr << "Failed to parse " << scenario_file_path << std::endl;
        return 1;
    }
    waymo::open_dataset::MotionExampleConversionConfig config;

    // Convert the scenario to a TensorFlow Example.
    std::map<std::string, int> counters;

    absl::StatusOr<tensorflow::Example> status_or_example =
        waymo::open_dataset::ScenarioToExample(scenario, config, &counters);
    if (!status_or_example.ok()) {
        std::cerr << "Failed to convert scenario to Example: "
                << status_or_example.status().message() << std::endl;
        return 1;
    }
    tensorflow::Example example = status_or_example.value();

    //  Output files to tfrecord
    // Create a new writable file
    tensorflow::Env* env = tensorflow::Env::Default();
    std::unique_ptr<tensorflow::WritableFile> file;
    std::string file_name = "example.tfrecord";
    env->NewWritableFile(file_name, &file);

    // Create a record writer and write the example to file
    tensorflow::io::RecordWriterOptions options = tensorflow::io::RecordWriterOptions::CreateRecordWriterOptions("");
    tensorflow::io::RecordWriter writer(file.get(), options);
    std::string example_string;
    example.SerializeToString(&example_string);
    writer.WriteRecord(example_string);

    // Close the file and output success message
    file->Close();
    std::cout << "Example exported to " << file_name << std::endl;
    std::cout << "Example exported to output.tfrecord." << std::endl;
    return 0;
}     ```
scott-ettinger commented 1 year ago

Hi, I think the issue is that the files are stored in the tensorflow tfrecord format. It looks like your code tries to read it directly as a string. You will need to read each scenario as a single record from the tfrecord input files, then process them and write them back out as your code currently does. I have not used it but I think that the RecordReader here might be what you need.

Note that if you are not modifying the default configuration (or the conversion code), we provide the converted data (using the default configuration) already in the open dataset repository.

Please let me know if you have further questions.