Closed ericyue closed 6 years ago
I fixed it by replacing https://github.com/tensorflow/models/blob/master/textsum/data.py#L93-L99 with
for example_str in tensorflow.python_io.tf_record_iterator(f):
yield example_pb2.Example.FromString(example_str)
i made a code for generating data and checking. https://github.com/dsindex/textsum
$ python generate_data.py --input_dir=sample --data_path=sample-0
$ python check_data.py --data_path=sample-0 --crc=4
but, there is a problem converting example_pb2.Example
to json.
features {
feature {
key: "abstract"
value {
bytes_list {
value: "a"
}
}
}
feature {
key: "article"
value {
bytes_list {
value: "a"
}
}
}
}
{
"features": {
"feature": {
"article": {
"bytesList": {
"value": [
"YQ==" <----- decoding ?
]
}
},
"abstract": {
"bytesList": {
"value": [
"YQ=="
]
}
}
}
}
}
does anybody know how to convert the binary like representation to string?
@udibr yes, i did this way too.
@dsindex the JSON value strings are base64 encoded so you need to decode them.
@jamcar23 thank you so much! :)
ericyue@
I didn't find formal implementation of recordio reader/writer when open-sourcing this model. Hence I updated the reader code to use a simple customized format. I forget to update the function comments as well. Will update the comment soon.
See https://github.com/tensorflow/models/pull/379/files for examples of making training data for the model.
Automatically closing due to lack of recent activity. Please update the issue when new information becomes available, and we will reopen the issue. Thanks!
Please let us know which model this issue is about (specify the top-level directory)
The standard recordio line, first 8 byte for message length, then 4 byte for crc , the last is message body. but, the toy dataset you provide only use 8 byte for length, 0 FOR CRC.
why?