google-research-datasets / richhf-18k

RichHF-18K dataset contains rich human feedback labels we collected for our CVPR'24 paper: https://arxiv.org/pdf/2312.10240, along with the file name of the associated labeled images (no urls or images are included in this dataset).
98 stars 2 forks source link

Convert to human readable format #6

Open caiqi opened 3 months ago

caiqi commented 3 months ago

For anyone interested, here is a simple snippet to convert the TFRecord file to JSON format:


import base64
import json

import tensorflow as tf

file_path = "dev.tfrecord"

def parse_tfrecord(record):
    example = tf.train.Example()
    example.ParseFromString(record.numpy())
    return example

def read_tfrecord_file(file_path):
    raw_dataset = tf.data.TFRecordDataset(file_path)
    parsed_records = []

    for raw_record in raw_dataset:
        example = parse_tfrecord(raw_record)
        record = {}
        for key, value in example.features.feature.items():
            if value.bytes_list.value:
                try:
                    # Try to decode as UTF-8 string
                    record[key] = value.bytes_list.value[0].decode('utf-8')
                except UnicodeDecodeError:
                    # If decoding fails, store as raw bytes
                    record[key] = base64.b64encode(value.bytes_list.value[0]).decode('utf-8')
            elif value.float_list.value:
                record[key] = value.float_list.value[0]
            elif value.int64_list.value:
                record[key] = value.int64_list.value[0]
        parsed_records.append(record)

    return parsed_records

records = read_tfrecord_file(file_path)
json_records = json.dumps(records, indent=4)

with open('output.json', 'w') as json_file:
    json_file.write(json_records)

print("TFRecord has been converted to JSON and saved as output.json")
leebird commented 3 months ago

Thanks for providing the codes! We have added a simple script to show how to retrieve the labels from the dataset at https://github.com/google-research/google-research/blob/master/richhf_18k/parse_tfrecord_file.py, which can be used together with this script to convert the dataset to JSON or other formats.