tensorflow / models

Models and examples built with TensorFlow
Other
77.18k stars 45.75k forks source link

##TextSum - Binary data reversed when using data_convert_example.py #504

Closed xtr33me closed 8 years ago

xtr33me commented 8 years ago

Has anyone else seen this issue. I currently am trying to get binary training/test data formatted for the TextSum model using the referenced data_convert_example.py. I have successfully been able to get my data working with the data_convert_example file from the formatting perspective, however when I run it through the textToBinary, it seems to take the publisher entry and put it at the beginning.

So my starting text file contains an entry like something below: article = <d> <p> <s> Heres a collection of some of the best strawberry-eaters in action. </s> </d> </p> abstract = <d> <p> <s> Turtles Love Strawberries </s> </d> </p> publisher=BUZZ

Then I run the textToBinary on it. I wont paste that here as I'm sure formatting would be all messed up, but it is reversed that that time. Then when I run it through the BinaryToText to return it back to the original I have the below result. publisher=BUZZ article = <d> <p> <s> Heres a collection of some of the best strawberry-eaters in action. </s> </d> </p> abstract = <d> <p> <s> Turtles Love Strawberries </s> </d> </p>

A snipit of what I am using to format the article data to the formatted data is this:

sentences = sent_tokenize((content[1]).encode('ascii', "ignore").strip('\n'))
       for sent in sentences:
           textSumFmt = self.textsumFmt
           finalRes = textSumFmt["artPref"]  + textSumFmt["sentPref"] + sent.replace("=", "equals") + textSumFmt["sentPost"] + textSumFmt["postVal"]
       finalRes += ('\t' + textSumFmt["absPref"] + textSumFmt["sentPref"] + (content[0]).strip('\n').replace("=", "equals") + textSumFmt["sentPost"] + textSumFmt["postVal"]) + '\t' +'publisher=BUZZ' + os.linesep
tatatodd commented 8 years ago

@xtr33me

Note that data_convert_example.py is simply converting your text file into a binary version of the tensorflow Example protobuffer, which is defined here: https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/example/example.proto https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/example/feature.proto

Basically we end up with a map from feature name to value. In your example above, the feature names are article, abstract and publisher, and the feature values are the corresponding parts after the equals "=". Since this is a map, the ordering of entries doesn't matter.

Does the different ordering actually break anything?

xtr33me commented 8 years ago

@tatatodd First off thanks for the explanation! This now makes more sense to me. I believe that I may have misunderstood what needed to occur. I used the data_convert_example.py _binary_to_text function to process the "data" file included in the textSum toy data and then I wrote my data formatter to match that output, which is why you see the actions I took above. When I would go to train against the data, I would receive the errors below. I was leaning towards the issue being the fact that my binary data was not being formatted correctly, but now I see that it is probably more aligned with my text file input files not being of the correct structure.

Truly thanks again for the help! I need to go back and look at the links you provided a bit more to see what I now need to do, but your explanation was great!

File "/home/daniel/Documents/Projects/headgen/textsum/batch_reader.py", line 263, in _GetExFeatureText return ex.features.feature[key].bytes_list.value[0] File "/usr/local/lib/python2.7/dist-packages/google/protobuf/internal/containers.py", line 204, in getitem return self._values[key] IndexError: list index out of range

xtr33me commented 8 years ago

Closing this issue as I now realize my input data was incorrect and this had nothing to do with the converter.