Closed xtr33me closed 8 years ago
@xtr33me
Note that data_convert_example.py is simply converting your text file into a binary version of the tensorflow Example protobuffer, which is defined here: https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/example/example.proto https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/example/feature.proto
Basically we end up with a map from feature name to value. In your example above, the feature names are article
, abstract
and publisher
, and the feature values are the corresponding parts after the equals "=". Since this is a map, the ordering of entries doesn't matter.
Does the different ordering actually break anything?
@tatatodd First off thanks for the explanation! This now makes more sense to me. I believe that I may have misunderstood what needed to occur. I used the data_convert_example.py _binary_to_text function to process the "data" file included in the textSum toy data and then I wrote my data formatter to match that output, which is why you see the actions I took above. When I would go to train against the data, I would receive the errors below. I was leaning towards the issue being the fact that my binary data was not being formatted correctly, but now I see that it is probably more aligned with my text file input files not being of the correct structure.
Truly thanks again for the help! I need to go back and look at the links you provided a bit more to see what I now need to do, but your explanation was great!
File "/home/daniel/Documents/Projects/headgen/textsum/batch_reader.py", line 263, in _GetExFeatureText return ex.features.feature[key].bytes_list.value[0] File "/usr/local/lib/python2.7/dist-packages/google/protobuf/internal/containers.py", line 204, in getitem return self._values[key] IndexError: list index out of range
Closing this issue as I now realize my input data was incorrect and this had nothing to do with the converter.
Has anyone else seen this issue. I currently am trying to get binary training/test data formatted for the TextSum model using the referenced data_convert_example.py. I have successfully been able to get my data working with the data_convert_example file from the formatting perspective, however when I run it through the textToBinary, it seems to take the publisher entry and put it at the beginning.
So my starting text file contains an entry like something below:
article = <d> <p> <s> Heres a collection of some of the best strawberry-eaters in action. </s> </d> </p> abstract = <d> <p> <s> Turtles Love Strawberries </s> </d> </p> publisher=BUZZ
Then I run the textToBinary on it. I wont paste that here as I'm sure formatting would be all messed up, but it is reversed that that time. Then when I run it through the BinaryToText to return it back to the original I have the below result.
publisher=BUZZ article = <d> <p> <s> Heres a collection of some of the best strawberry-eaters in action. </s> </d> </p> abstract = <d> <p> <s> Turtles Love Strawberries </s> </d> </p>
A snipit of what I am using to format the article data to the formatted data is this: