Closed NielsRogge closed 4 years ago
The data creation looks okay.
It's important to point out that only the files in the tf_examples directory are in TF example format. The files in the interaction directory are also TF records but they hold serialized interaction protos.
Which files are you trying to open?
I am wondering whether it's a TF 1 / 2 issue.
Can you try this:
import tensorflow.compat.v1 as tf
tf.disable_v2_behavior()
def iterate_examples(filepath):
for value in tf.python_io.tf_record_iterator(filepath):
i = tf.train.Example()
i.ParseFromString(value)
yield i
Actually never mind both code snippets should work fine.
TF examples are compressed with GZIP by default when using run_task_main.py:
flags.DEFINE_string(
'compression_type',
'GZIP',
"Compression to use when reading tfrecords. '' for no compression.",
)
I think you need to specify this when reading the data as well:
tf.data.TFRecordDataset(
filenames, compression_type="GZIP",
)
Awesome, setting the compression_type
parameter to "GZIP" works and lets me read in the data.
I ran the data creation again, without compression, because I'm using a package to read in tfrecords as PyTorch datasets, which currently does not support compression types. What it basically does is looking at each of the features and yielding a dictionary of keys (feature names) and values (numpy arrays):
description =
{'aggregation_function_id': 'int',
'answer': 'float',
'classification_class_index': 'int',
'column_ids': 'int',
'column_ranks': 'int',
'input_ids': 'int',
'input_mask': 'int',
'inv_column_ranks': 'int',
'label_ids': 'int',
'numeric_relations': 'int',
'numeric_values': 'float',
'numeric_values_scale': 'float',
'prev_label_ids': 'int',
'question_id': 'byte',
'question_id_ints': 'int',
'question_numeric_values': 'float',
'row_ids': 'int',
'segment_ids': 'int',
'table_id': 'byte',
'table_id_hash': 'int'}
features = {}
for key, typename in description.items():
if key not in all_keys:
raise KeyError(f"Key {key} doesn't exist (select from {all_keys})!")
# NOTE: We assume that each key in the example has only one field
# (either "bytes_list", "float_list", or "int64_list")!
field = example.features.feature[key].ListFields()[0]
inferred_typename, value = field[0].name, field[1].value
if typename is not None:
tf_typename = typename_mapping[typename]
if tf_typename != inferred_typename:
reversed_mapping = {v: k for k, v in typename_mapping.items()}
raise TypeError(f"Incompatible type '{typename}' for `{key}` "
f"(should be '{reversed_mapping[inferred_typename]}').")
# Decode raw bytes into respective data types
if inferred_typename == "bytes_list":
value = np.frombuffer(value[0], dtype=np.uint8)
elif inferred_typename == "float_list":
value = np.array(value, dtype=np.float32)
elif inferred_typename == "int64_list":
value = np.array(value, dtype=np.int32)
features[key] = value
yield features
However, for some reason, reading in the test set results in an overflow error:
168 value = np.array(value, dtype=np.float32)
169 elif inferred_typename == "int64_list":
--> 170 value = np.array(value, dtype=np.int32)
171 features[key] = value
172
OverflowError: Python int too large to convert to C long
This might be a Windows-specific issue so first I"ll try it out in Google Colab.
Thank you for your help!
I ran the following command to create tfrecords from the SQA TSV files (I'm on Windows, Python version 3.6.4, installed the protobuf compiler and tapas package as explained in your README):
This printed the following:
This resulted in 2 directories being created in the "output" directory, namely "interactions" and "tf_examples". In the "tf_examples" directory, only the first random split of training + dev seems to be created:
However, parsing these tfrecord files as strings (as explained in the Tensorflow docs) results in an error:
Am I doing something wrong here?