textsum: dataset format maybe wrong?

tensorflow / models

Models and examples built with TensorFlow

Other

77.24k stars 45.75k forks source link

textsum: dataset format maybe wrong? #357

Closed ericyue closed 6 years ago

ericyue commented 8 years ago

Please let us know which model this issue is about (specify the top-level directory)

The standard recordio line, first 8 byte for message length, then 4 byte for crc , the last is message body. but， the toy dataset you provide only use 8 byte for length, 0 FOR CRC.

why?

udibr commented 8 years ago

I fixed it by replacing https://github.com/tensorflow/models/blob/master/textsum/data.py#L93-L99 with

      for example_str in tensorflow.python_io.tf_record_iterator(f):
        yield example_pb2.Example.FromString(example_str)

dsindex commented 8 years ago

i made a code for generating data and checking. https://github.com/dsindex/textsum

$ python generate_data.py --input_dir=sample --data_path=sample-0
$ python check_data.py --data_path=sample-0 --crc=4

but, there is a problem converting example_pb2.Example to json.

features {
  feature {
    key: "abstract"
    value {
      bytes_list {
        value: "a"
      }
    }
  }
  feature {
    key: "article"
    value {
      bytes_list {
        value: "a"
      }
    }
  }
}

{
  "features": {
    "feature": {
      "article": {
        "bytesList": {
          "value": [
            "YQ=="   <----- decoding ?
          ]
        }
      },
      "abstract": {
        "bytesList": {
          "value": [
            "YQ=="   
          ]
        }
      }
    }
  }
}

does anybody know how to convert the binary like representation to string?

ericyue commented 8 years ago

@udibr yes, i did this way too.

jamcar23 commented 8 years ago

@dsindex the JSON value strings are base64 encoded so you need to decode them.

dsindex commented 8 years ago

@jamcar23 thank you so much! :)

panyx0718 commented 8 years ago

ericyue@

I didn't find formal implementation of recordio reader/writer when open-sourcing this model. Hence I updated the reader code to use a simple customized format. I forget to update the function comments as well. Will update the comment soon.

panyx0718 commented 8 years ago

See https://github.com/tensorflow/models/pull/379/files for examples of making training data for the model.

gunan commented 6 years ago

Automatically closing due to lack of recent activity. Please update the issue when new information becomes available, and we will reopen the issue. Thanks!