google-research / tapas

End-to-end neural table-text understanding models.
Apache License 2.0
1.15k stars 217 forks source link

Corrupted SQA tfrecords #52

Closed NielsRogge closed 4 years ago

NielsRogge commented 4 years ago

I ran the following command to create tfrecords from the SQA TSV files (I'm on Windows, Python version 3.6.4, installed the protobuf compiler and tapas package as explained in your README):

python tapas/run_task_main.py \
  --task="SQA" \
  --input_dir="./SQA release 1.0" \
  --output_dir="./output" \
  --bert_vocab_file="vocab.txt" \
  --mode="create_data"

This printed the following:

WARNING:tensorflow:From C:\Users\niels.rogge\Documents\Python projecten\testing_tapas\env\lib\site-packages\tapas\scripts\prediction_utils.py:41: tf_record_iterator (from tensorflow.python.lib.io.tf_record) is deprecated and will be removed in a future version.
Instructions for updating:
Use eager execution and:
`tf.data.TFRecordDataset(path)`
W0813 12:00:32.413115  7056 deprecation.py:323] From C:\Users\niels.rogge\Documents\Python projecten\testing_tapas\env\lib\site-packages\tapas\scripts\prediction_utils.py:41: tf_record_iterator (from tensorflow.python.lib.io.tf_record) is deprecated and will be removed in a future version.
Instructions for updating:
Use eager execution and:
`tf.data.TFRecordDataset(path)`
I0813 12:00:32.430038  7056 number_annotation_utils.py:150] Can't consolidate types: (None, text: "Title"
) {0: [], 1: [float_value: 1.0
, float_value: 2.0
], 2: [float_value: 1.0
, float_value: 1945.0
, date {
  year: 1945
}
], 3: [float_value: 2.0
], 4: [float_value: 3.0
], 5: [], 6: [], 7: []} 4
I0813 12:00:37.874481  7056 number_annotation_utils.py:150] Can't consolidate types: (None, text: "Album"
) {0: [], 1: [], 2: [], 3: [], 4: [], 5: [float_value: 5.0
], 6: [], 7: [], 8: [], 9: [float_value: 91.0
]} 2
I0813 12:00:43.282050  7056 number_annotation_utils.py:150] Can't consolidate types: (None, text: "Competition"
) {0: [float_value: 4.0
], 1: [], 2: [], 3: [], 4: [float_value: 1.0
], 5: [float_value: 2.0
], 6: [], 7: [float_value: 2.0
], 8: [float_value: 2.0
], 9: [], 10: [float_value: 2.0
], 11: [], 12: [], 13: []} 6
I0813 12:00:48.533008  7056 number_annotation_utils.py:150] Can't consolidate types: (None, text: "Roma"
) {0: [float_value: 6.5
], 1: [], 2: [], 3: [], 4: [], 5: [], 6: [float_value: 2.5999999046325684
], 7: [], 8: [float_value: 6.0
], 9: [float_value: 1.7000000476837158
], 10: [float_value: 4.900000095367432
], 11: [float_value: 11.600000381469727
], 12: [float_value: 13.600000381469727
], 13: [float_value: 17.899999618530273
]} 8
I0813 12:00:54.263187  7056 number_annotation_utils.py:150] Can't consolidate types: (None, text: "Position"
) {0: [float_value: 1.0
], 1: [], 2: [], 3: [], 4: [], 5: [], 6: [], 7: [float_value: 3.0
], 8: [float_value: 3.0
], 9: [float_value: 1.0
], 10: [], 11: [], 12: [], 13: [], 14: [float_value: 1.0
], 15: [], 16: [], 17: [], 18: [], 19: [], 20: [], 21: [], 22: [], 23: [], 24: [], 25: [], 26: [], 27: [], 28: [], 29: []} 5
I0813 12:01:01.409110  7056 number_annotation_utils.py:150] Can't consolidate types: (None, text: "Notes"
) {0: [float_value: 1.0
, float_value: 2.0
], 1: [float_value: 1.0
], 2: [float_value: 2.0
], 3: [], 4: [], 5: [], 6: [], 7: [], 8: []} 3
I0813 12:01:07.108868  7056 number_annotation_utils.py:150] Can't consolidate types: (None, text: "Position"
) {0: [], 1: [], 2: [], 3: [], 4: [], 5: [], 6: [float_value: 1.0
], 7: [], 8: [], 9: [], 10: [float_value: 3.0
], 11: [], 12: [], 13: [], 14: [], 15: [], 16: [], 17: [], 18: [], 19: [], 20: [], 21: [], 22: [float_value: 1.0
], 23: [], 24: [float_value: 2.0
], 25: []} 4
I0813 12:01:13.811882  7056 number_annotation_utils.py:150] Can't consolidate types: (None, text: "Club"
) {0: [], 1: [float_value: 1.0
], 2: [float_value: 2000.0
, date {
  year: 2000
}
], 3: [float_value: 60.0
], 4: [], 5: [], 6: [], 7: [], 8: [], 9: [], 10: [], 11: []} 3
I0813 12:01:19.658277  7056 number_annotation_utils.py:150] Can't consolidate types: (None, text: "Notes"
) {0: [float_value: 6.0
, float_value: 666.0
], 1: [float_value: 666.0
], 2: [float_value: 6.0
, float_value: 666.0
], 3: [float_value: 4.0
, float_value: 666.0
, float_value: 666.0
], 4: [], 5: [], 6: []} 4
I0813 12:01:24.497306  7056 number_annotation_utils.py:150] Can't consolidate types: (None, text: "Release date"
) {0: [date {
  year: 1943
  month: 7
  day: 2
}
], 1: [], 2: [date {
  year: 1943
  month: 1
  day: 25
}
, date {
  year: 1943
  month: 3
  day: 3
}
], 3: [date {
  year: 1943
  month: 1
  day: 16
}
], 4: [], 5: [date {
  year: 1943
  month: 9
  day: 22
}
], 6: [date {
  year: 1943
  month: 6
  day: 14
}
], 7: [], 8: [], 9: [date {
  year: 1943
  month: 4
  day: 10
}
], 10: [date {
  year: 1943
  month: 10
  day: 28
}
], 11: []} 7
I0813 12:01:31.469692  7056 number_annotation_utils.py:150] Can't consolidate types: (None, text: "Device"
) {0: [], 1: [float_value: 25.0
, float_value: 2.0
], 2: [float_value: 25.0
], 3: [], 4: [float_value: 3.0
], 5: [], 6: [float_value: 4.0
], 7: [float_value: 20.0
], 8: [], 9: [], 10: [float_value: 3.0
, float_value: 2.0
], 11: [], 12: [float_value: 3200.0
], 13: [], 14: [float_value: 1.0
], 15: [float_value: 3010.0
, float_value: 6010.0
], 16: [float_value: 720.0
], 17: [float_value: 4.0
], 18: [], 19: [float_value: 6000.0
], 20: [float_value: 630.0
], 21: [], 22: [float_value: 2.0
, float_value: 4.0
], 23: [float_value: 70.0
], 24: [float_value: 2.0
], 25: [float_value: 6240.0
], 26: [float_value: 2.0
]} 18
I0813 12:01:38.270512  7056 number_annotation_utils.py:150] Can't consolidate types: (None, text: "Mission result"
) {0: [], 1: [], 2: [], 3: [], 4: [], 5: [], 6: [], 7: [], 8: [], 9: [], 10: [], 11: [float_value: 1977.0
, date {
  year: 1977
}
, float_value: 500.0
, float_value: 18.0
], 12: [], 13: [], 14: [], 15: [float_value: 200.0
, float_value: 7.099999904632568
], 16: [], 17: [float_value: 60.0
, float_value: 2.0
, float_value: 2.0999999046325684
, float_value: 4.400000095367432
], 18: [float_value: 2.0
, float_value: 4.400000095367432
], 19: [], 20: [float_value: 1.0
, float_value: 2.200000047683716
], 21: [], 22: [float_value: 200.0
, float_value: 7.099999904632568
], 23: [], 24: [float_value: 200.0
, float_value: 440.0
]} 7
I0813 12:01:45.194988  7056 number_annotation_utils.py:150] Can't consolidate types: (None, text: "Driver"
) {0: [], 1: [], 2: [], 3: [], 4: [], 5: [], 6: [], 7: [], 8: [], 9: [], 10: [], 11: [], 12: [], 13: [float_value: 2.0
], 14: [], 15: [], 16: []} 1
I0813 12:01:50.528725  7056 number_annotation_utils.py:150] Can't consolidate types: (None, text: "Time/Retired"
) {0: [float_value: 1.0
, float_value: 36.0
, float_value: 24.226999282836914
], 1: [float_value: 16.4950008392334
], 2: [float_value: 23.351999282836914
], 3: [float_value: 42.62699890136719
], 4: [float_value: 43.933998107910156
], 5: [float_value: 47.775001525878906
], 6: [float_value: 53.59700012207031
], 7: [float_value: 54.119998931884766
], 8: [float_value: 54.43299865722656
], 9: [float_value: 54.749000549316406
], 10: [float_value: 1.0
, float_value: 7.539999961853027
], 11: [float_value: 1.0
, float_value: 11.298999786376953
], 12: [float_value: 1.0
], 13: [], 14: [], 15: [], 16: [], 17: [], 18: [], 19: []} 13
I0813 12:01:55.352831  7056 number_annotation_utils.py:150] Can't consolidate types: (None, text: "UK"
) {0: [], 1: [float_value: 31.0
], 2: [float_value: 31.0
], 3: [float_value: 168.0
], 4: [float_value: 108.0
], 5: [], 6: [], 7: [], 8: [], 9: []} 4
I0813 12:02:01.081506  7056 number_annotation_utils.py:150] Can't consolidate types: (None, text: "Storage"
) {0: [], 1: [float_value: 8.0
], 2: [float_value: 6.0
], 3: [], 4: [], 5: [float_value: 1600.0
, float_value: 9427.0
, float_value: 1.0
, float_value: 1.0
], 6: [], 7: [float_value: 9427.0
, float_value: 4.0
, float_value: 2.0
, float_value: 8000.0
], 8: [float_value: 32.0
, float_value: 18.0
, float_value: 8.0
, float_value: 32.0
, float_value: 1.0
, float_value: 9427.0
, float_value: 2.0
, float_value: 232.0
, float_value: 4.0
, float_value: 2.0
, float_value: 4.0
, float_value: 2.0
], 9: [float_value: 7.0
, float_value: 2.0
, float_value: 4.0
, float_value: 32.0
, float_value: 18.0
, float_value: 8.0
, float_value: 32.0
], 10: [float_value: 7.0
, float_value: 2.0
, float_value: 4.0
, float_value: 32.0
, float_value: 18.0
, float_value: 8.0
, float_value: 32.0
], 11: []} 7
I0813 12:02:10.043508  7056 number_annotation_utils.py:150] Can't consolidate types: (None, text: "Features"
) {0: [float_value: 2.5
, float_value: 720.0
, float_value: 7.300000190734863
], 1: [], 2: [float_value: 2.299999952316284
, float_value: 3.5
, float_value: 1080.0
], 3: [], 4: [], 5: [], 6: [], 7: [], 8: [float_value: 1080.0
], 9: [float_value: 1080.0
], 10: [float_value: 9200.0
]} 5
I0813 12:02:14.576418  7056 number_annotation_utils.py:150] Can't consolidate types: (None, text: "ARQB"
) {0: [], 1: [], 2: [], 3: [], 4: [], 5: [], 6: [], 7: [], 8: [], 9: [float_value: 28.0
], 10: [], 11: [], 12: [], 13: [], 14: [], 15: [], 16: [], 17: [], 18: [], 19: [], 20: [], 21: [], 22: [], 23: [], 24: []} 1
I0813 12:02:19.323728  7056 number_annotation_utils.py:150] Can't consolidate types: (None, text: "Club"
) {0: [], 1: [float_value: 1.0
], 2: [], 3: [], 4: [], 5: [], 6: [], 7: [], 8: [], 9: [], 10: [], 11: [], 12: [], 13: [], 14: [], 15: [], 16: [], 17: [], 18: [], 19: [], 20: [], 21: []} 1
I0813 12:02:27.257509  7056 number_annotation_utils.py:150] Can't consolidate types: (None, text: "Synopsis"
) {0: [], 1: [], 2: [], 3: [], 4: [], 5: [], 6: [], 7: [], 8: [float_value: 2.0
], 9: [], 10: []} 1
I0813 12:02:33.690273  7056 number_annotation_utils.py:150] Can't consolidate types: (None, text: "Film"
) {0: [], 1: [], 2: [], 3: [float_value: 3.0
], 4: [], 5: [], 6: [], 7: [], 8: [], 9: [], 10: [float_value: 7.0
], 11: [], 12: [], 13: [], 14: [], 15: [float_value: 101.0
], 16: []} 3
I0813 12:02:40.156911  7056 number_annotation_utils.py:150] Can't consolidate types: (None, text: "Record"
) {0: [float_value: 5.0
, float_value: 4.0
, float_value: 0.0
], 1: [], 2: [], 3: [float_value: 4.0
, float_value: 5.0
, float_value: 0.0
], 4: [], 5: [], 6: [], 7: [float_value: 21.0
, float_value: 23.0
, float_value: 0.0
], 8: [float_value: 90.0
, float_value: 76.0
, float_value: 2.0
], 9: [float_value: 17.0
, float_value: 8.0
, float_value: 0.0
], 10: [float_value: 59.0
, float_value: 37.0
, float_value: 1.0
], 11: [float_value: 515.0
, float_value: 376.0
, float_value: 0.0
], 12: [float_value: 634.0
, float_value: 328.0
, float_value: 0.0
], 13: [float_value: 145.0
, float_value: 92.0
, float_value: 0.0
], 14: [float_value: 115.0
, float_value: 113.0
, float_value: 0.0
], 15: [float_value: 167.0
, float_value: 126.0
], 16: [float_value: 17.0
, float_value: 8.0
, float_value: 0.0
], 17: []} 12
I0813 12:02:46.707394  7056 number_annotation_utils.py:150] Can't consolidate types: (None, text: "Laid down"
) {0: [], 1: [], 2: [], 3: [date {
  year: 1899
  month: 4
  day: 10
}
], 4: [date {
  year: 1899
  month: 4
  day: 10
}
], 5: [], 6: []} 2
I0813 12:02:49.988619  7056 number_annotation_utils.py:150] Can't consolidate types: (None, text: "Moved To"
) {0: [float_value: 1932.0
, date {
  year: 1932
}
], 1: [float_value: 2003.0
, date {
  year: 2003
}
], 2: [float_value: 1994.0
, date {
  year: 1994
}
], 3: [], 4: [float_value: 1987.0
, date {
  year: 1987
}
], 5: [float_value: 2008.0
, date {
  year: 2008
}
], 6: [float_value: 1987.0
, date {
  year: 1987
}
], 7: [], 8: []} 6
I0813 12:02:54.348928  7056 number_annotation_utils.py:150] Can't consolidate types: (None, text: "TV"
) {0: [], 1: [], 2: [], 3: [], 4: [float_value: 12.0
], 5: [float_value: 12.0
], 6: [float_value: 12.0
], 7: [], 8: [], 9: [], 10: [], 11: [], 12: [], 13: [], 14: [], 15: [], 16: [], 17: [], 18: [], 19: [], 20: [], 21: [], 22: [], 23: [], 24: [], 25: [], 26: [], 27: [], 28: [], 29: []} 3
I0813 12:03:00.832618  7056 number_annotation_utils.py:150] Can't consolidate types: (None, text: "Motto"
) {0: [], 1: [], 2: [], 3: [float_value: 1.0
], 4: [], 5: [], 6: [], 7: []} 1
I0813 12:03:08.307635  7056 number_annotation_utils.py:150] Can't consolidate types: (None, text: "Player"
) {0: [date {
  month: 1
}
], 1: [], 2: [], 3: [], 4: [], 5: [], 6: [], 7: [], 8: [], 9: []} 1
I0813 12:03:16.709165  7056 number_annotation_utils.py:150] Can't consolidate types: (None, text: "College/Junior/Club Team"
) {0: [], 1: [], 2: [], 3: [], 4: [float_value: 67.0
], 5: [], 6: [], 7: [], 8: []} 1
I0813 12:03:23.995716  7056 number_annotation_utils.py:150] Can't consolidate types: (None, text: "Mouth coordinates"
) {0: [float_value: 33.0
, float_value: 30.0
, float_value: 38.0
, float_value: 117.0
, float_value: 45.0
, float_value: 12.0
, float_value: 33.510501861572266
, float_value: 117.7531967163086
], 1: [float_value: 33.0
, float_value: 32.0
, float_value: 28.0
, float_value: 117.0
, float_value: 44.0
, float_value: 13.0
, float_value: 33.541099548339844
, float_value: 117.73690032958984
], 2: [float_value: 33.0
, float_value: 32.0
, float_value: 32.0
, float_value: 117.0
, float_value: 42.0
, float_value: 16.0
, float_value: 33.542198181152344
, float_value: 117.70439910888672
], 3: [], 4: [], 5: [], 6: [], 7: [float_value: 33.0
, float_value: 37.0
, float_value: 42.0
, float_value: 117.0
, float_value: 40.0
, float_value: 52.0
, float_value: 33.628299713134766
, float_value: 117.68109893798828
], 8: []} 4
I0813 12:03:29.895905  7056 number_annotation_utils.py:150] Can't consolidate types: (None, text: "Points"
) {0: [float_value: 8.0
], 1: [float_value: 6.0
], 2: [float_value: 4.0
], 3: [float_value: 3.0
], 4: [float_value: 2.0
], 5: [], 6: [], 7: [], 8: [], 9: [], 10: [float_value: 1.0
], 11: [], 12: [], 13: [], 14: []} 6
I0813 12:03:34.506606  7056 number_annotation_utils.py:150] Can't consolidate types: (None, text: "Main Functionality"
) {0: [], 1: [], 2: [], 3: [], 4: [], 5: [], 6: [], 7: [], 8: [], 9: [], 10: [float_value: 1.0
], 11: [], 12: [], 13: [], 14: [], 15: [], 16: []} 1
I0813 12:03:43.860592  7056 number_annotation_utils.py:150] Can't consolidate types: (None, text: "Developer(s)"
) {0: [], 1: [], 2: [], 3: [], 4: [], 5: [], 6: [], 7: [], 8: [], 9: [], 10: [], 11: [], 12: [], 13: [float_value: 2.0
], 14: [], 15: [], 16: [], 17: []} 1
Processed: random-split-1-train.tfrecord
I0813 12:03:48.029445  7056 run_task_main.py:152] Processed: random-split-1-train.tfrecord
Num questions processed: 12276
I0813 12:03:48.030410  7056 run_task_main.py:152] Num questions processed: 12276
Num examples: 12276
I0813 12:03:48.030410  7056 run_task_main.py:152] Num examples: 12276
Num conversion errors: 0
I0813 12:03:48.031406  7056 run_task_main.py:152] Num conversion errors: 0
I0813 12:03:52.458602  7056 number_annotation_utils.py:150] Can't consolidate types: (None, text: "Title"
) {0: [], 1: [], 2: [float_value: 1.0
], 3: [], 4: [], 5: [float_value: 2.0
], 6: [float_value: 2.0
], 7: [], 8: [], 9: [float_value: 3.0
], 10: [], 11: [], 12: [], 13: [float_value: 4.0
], 14: [], 15: [], 16: [], 17: [], 18: [], 19: []} 5
I0813 12:03:59.948574  7056 number_annotation_utils.py:150] Can't consolidate types: (None, text: "Name"
) {0: [], 1: [], 2: [], 3: [], 4: [], 5: [], 6: [], 7: [], 8: [], 9: [], 10: [], 11: [], 12: [], 13: [float_value: 21.0
, float_value: 2.0
], 14: [], 15: [], 16: [], 17: [float_value: 21.0
, float_value: 1.0
], 18: [], 19: [], 20: [], 21: [], 22: [], 23: [], 24: [], 25: []} 2
I0813 12:04:09.289596  7056 number_annotation_utils.py:150] Can't consolidate types: (None, text: "Stamp set"
) {0: [float_value: 1873.0
, date {
  year: 1873
}
, float_value: 1973.0
, date {
  year: 1973
}
], 1: [float_value: 400.0
], 2: [float_value: 19.0
], 3: [], 4: [], 5: [], 6: [], 7: [], 8: [], 9: [], 10: [], 11: [float_value: 150.0
], 12: [], 13: [], 14: [], 15: []} 4
I0813 12:04:14.832739  7056 number_annotation_utils.py:150] Can't consolidate types: (None, text: "Nominated work"
) {0: [], 1: [], 2: [], 3: [float_value: 12.0
], 4: [], 5: [], 6: [], 7: [], 8: [], 9: [], 10: [], 11: [], 12: [], 13: [], 14: [], 15: [], 16: [], 17: [], 18: []} 1
I0813 12:04:21.392229  7056 number_annotation_utils.py:150] Can't consolidate types: (None, text: "Name"
) {0: [], 1: [], 2: [], 3: [], 4: [], 5: [], 6: [], 7: [], 8: [], 9: [], 10: [], 11: [], 12: [], 13: [float_value: 21.0
, float_value: 2.0
], 14: [], 15: [], 16: [], 17: [float_value: 21.0
, float_value: 1.0
], 18: [], 19: [], 20: [], 21: [], 22: [], 23: [], 24: [], 25: []} 2
I0813 12:04:29.098621  7056 number_annotation_utils.py:150] Can't consolidate types: (None, text: "Club"
) {0: [], 1: [], 2: [], 3: [], 4: [], 5: [], 6: [], 7: [], 8: [], 9: [], 10: [], 11: [], 12: [float_value: 11.0
], 13: [], 14: [], 15: [], 16: [], 17: [], 18: [], 19: [], 20: [], 21: [], 22: []} 1
Processed: random-split-1-dev.tfrecord
I0813 12:04:30.313367  7056 run_task_main.py:152] Processed: random-split-1-dev.tfrecord
Num questions processed: 2265
I0813 12:04:30.317362  7056 run_task_main.py:152] Num questions processed: 2265
Num examples: 2265
I0813 12:04:30.318328  7056 run_task_main.py:152] Num examples: 2265
Num conversion errors: 0
I0813 12:04:30.318328  7056 run_task_main.py:152] Num conversion errors: 0
Padded with 7 examples.
I0813 12:04:30.327304  7056 run_task_main.py:152] Padded with 7 examples.
I0813 12:04:36.732208  7056 number_annotation_utils.py:150] Can't consolidate types: (None, text: "Comments"
) {0: [float_value: 202.60000610351562
, float_value: 126.0
, float_value: 201.1999969482422
, float_value: 125.0
, float_value: 9.0
], 1: [], 2: [float_value: 127.0999984741211
, float_value: 205.0
, float_value: 1905.0
], 3: [float_value: 2011.0
], 4: [float_value: 112.5
, float_value: 181.0
, float_value: 14.0
, float_value: 23.0
, float_value: 136.0
, float_value: 219.0
, float_value: 74.9000015258789
, float_value: 121.0
], 5: [float_value: 112.5
], 6: [float_value: 1.0
, float_value: 100.0
], 7: [float_value: 75.5
, float_value: 122.0
, float_value: 85.0
, float_value: 137.0
, float_value: 89.91999816894531
, float_value: 145.0
, float_value: 68.9000015258789
, float_value: 110.9000015258789
], 8: [float_value: 1.0
, float_value: 100.0
, float_value: 161.0
], 9: [float_value: 1934.0
, date {
  year: 1934
}
, float_value: 1.0
, float_value: 100.0
, float_value: 161.0
], 10: [], 11: [], 12: [float_value: 112.0
, float_value: 180.0
, float_value: 1.0
, float_value: 100.0
, float_value: 161.0
], 13: [float_value: 80.0
, float_value: 129.0
], 14: [float_value: 1.0
, float_value: 60.0
, float_value: 97.0
, float_value: 26.0
, float_value: 42.0
, float_value: 26.0
], 15: [], 16: [], 17: []} 12
I0813 12:04:45.114791  7056 number_annotation_utils.py:150] Can't consolidate types: (None, text: "Ora."
) {0: [float_value: 42.0
], 1: [float_value: 38.0
], 2: [float_value: 28.0
], 3: [float_value: 34.0
], 4: [float_value: 28.0
], 5: [float_value: 10.0
], 6: [float_value: 11.0
], 7: [float_value: 24.0
], 8: [], 9: [float_value: 32.0
], 10: [], 11: [float_value: 11.0
], 12: [], 13: [float_value: 28.0
], 14: [], 15: [], 16: [], 17: [], 18: [], 19: [], 20: [], 21: [], 22: [float_value: 10.0
], 23: []} 12
I0813 12:04:51.605434  7056 number_annotation_utils.py:150] Can't consolidate types: (None, text: "Invitational"
) {0: [float_value: 8.0
, float_value: 1999.0
, date {
  year: 1999
}
, float_value: 2000.0
, date {
  year: 2000
}
, float_value: 2001.0
, date {
  year: 2001
}
, float_value: 2005.0
, date {
  year: 2005
}
, float_value: 2006.0
, date {
  year: 2006
}
, float_value: 2007.0
, date {
  year: 2007
}
, float_value: 2009.0
, date {
  year: 2009
}
, float_value: 2013.0
, date {
  year: 2013
}
], 1: [], 2: [float_value: 1.0
, float_value: 2003.0
, date {
  year: 2003
}
], 3: [], 4: [float_value: 1.0
, float_value: 2010.0
, date {
  year: 2010
}
], 5: [], 6: []} 3
I0813 12:04:57.410878  7056 number_annotation_utils.py:150] Can't consolidate types: (None, text: "National Cup"
) {0: [], 1: [float_value: 1.0
], 2: [], 3: [], 4: [], 5: [], 6: [], 7: [], 8: [], 9: [], 10: [], 11: [], 12: [], 13: [], 14: [], 15: [], 16: [], 17: [], 18: [], 19: [], 20: [], 21: [], 22: [], 23: [], 24: [], 25: [], 26: []} 1
I0813 12:05:02.878291  7056 number_annotation_utils.py:150] Can't consolidate types: (None, text: "Team"
) {0: [], 1: [], 2: [], 3: [], 4: [], 5: [], 6: [], 7: [float_value: 2.0
], 8: [], 9: []} 1
I0813 12:05:09.434752  7056 number_annotation_utils.py:150] Can't consolidate types: (None, text: "Time/Retired"
) {0: [float_value: 1.0
, float_value: 48.0
, float_value: 31.299999237060547
], 1: [float_value: 9.5
], 2: [float_value: 2.0
], 3: [float_value: 2.0
], 4: [float_value: 3.0
], 5: [float_value: 3.0
], 6: [float_value: 4.0
], 7: [float_value: 4.0
], 8: [], 9: [], 10: [], 11: [], 12: [], 13: [], 14: [], 15: [], 16: [], 17: []} 8
I0813 12:05:14.051411  7056 number_annotation_utils.py:150] Can't consolidate types: (None, text: "1941/42"
) {0: [], 1: [float_value: 116000.0
], 2: [float_value: 220000.0
], 3: [float_value: 71000.0
], 4: [], 5: [], 6: [float_value: 407000.0
]} 4
Processed: test.tfrecord
I0813 12:05:21.113520  7056 run_task_main.py:152] Processed: test.tfrecord
Num questions processed: 3012
I0813 12:05:21.117669  7056 run_task_main.py:152] Num questions processed: 3012
Num examples: 3012
I0813 12:05:21.118668  7056 run_task_main.py:152] Num examples: 3012
Num conversion errors: 0
I0813 12:05:21.119353  7056 run_task_main.py:152] Num conversion errors: 0
Padded with 28 examples.
I0813 12:05:21.151138  7056 run_task_main.py:152] Padded with 28 examples.

This resulted in 2 directories being created in the "output" directory, namely "interactions" and "tf_examples". In the "tf_examples" directory, only the first random split of training + dev seems to be created: tapas_pic

However, parsing these tfrecord files as strings (as explained in the Tensorflow docs) results in an error:

filenames = "./tf_examples/test.tfrecord"
raw_dataset = tf.data.TFRecordDataset(filenames)

for raw_record in raw_dataset.take(1):
  example = tf.train.Example()
  example.ParseFromString(raw_record.numpy())
  print(example)

DataLossError                             Traceback (most recent call last)
c:\Users\niels.rogge\Documents\Python projecten\testing_tapas\env\lib\site-packages\tensorflow\python\eager\context.py in execution_mode(mode)
   1985       ctx.executor = executor_new
-> 1986       yield
   1987     finally:

c:\Users\niels.rogge\Documents\Python projecten\testing_tapas\env\lib\site-packages\tensorflow\python\data\ops\iterator_ops.py in _next_internal(self)
    654             output_types=self._flat_output_types,
--> 655             output_shapes=self._flat_output_shapes)
    656 

c:\Users\niels.rogge\Documents\Python projecten\testing_tapas\env\lib\site-packages\tensorflow\python\ops\gen_dataset_ops.py in iterator_get_next(iterator, output_types, output_shapes, name)
   2362     except _core._NotOkStatusException as e:
-> 2363       _ops.raise_from_not_ok_status(e, name)
   2364   # Add nodes to the TensorFlow graph.

c:\Users\niels.rogge\Documents\Python projecten\testing_tapas\env\lib\site-packages\tensorflow\python\framework\ops.py in raise_from_not_ok_status(e, name)
   6652   # pylint: disable=protected-access
-> 6653   six.raise_from(core._status_to_exception(e.code, message), None)
   6654   # pylint: enable=protected-access

c:\Users\niels.rogge\Documents\Python projecten\testing_tapas\env\lib\site-packages\six.py in raise_from(value, from_value)

DataLossError: corrupted record at 0 [Op:IteratorGetNext]

During handling of the above exception, another exception occurred:

DataLossError                             Traceback (most recent call last)
 in 
      2 raw_dataset = tf.data.TFRecordDataset(filenames)
      3 
----> 4 for raw_record in raw_dataset.take(1):
      5   example = tf.train.Example()
      6   example.ParseFromString(raw_record.numpy())

c:\Users\niels.rogge\Documents\Python projecten\testing_tapas\env\lib\site-packages\tensorflow\python\data\ops\iterator_ops.py in __next__(self)
    629 
    630   def __next__(self):  # For Python 3 compatibility
--> 631     return self.next()
    632 
    633   def _next_internal(self):

c:\Users\niels.rogge\Documents\Python projecten\testing_tapas\env\lib\site-packages\tensorflow\python\data\ops\iterator_ops.py in next(self)
    668     """Returns a nested structure of `Tensor`s containing the next element."""
    669     try:
--> 670       return self._next_internal()
    671     except errors.OutOfRangeError:
    672       raise StopIteration

c:\Users\niels.rogge\Documents\Python projecten\testing_tapas\env\lib\site-packages\tensorflow\python\data\ops\iterator_ops.py in _next_internal(self)
    659         return self._element_spec._from_compatible_tensor_list(ret)  # pylint: disable=protected-access
    660       except AttributeError:
--> 661         return structure.from_compatible_tensor_list(self._element_spec, ret)
    662 
    663   @property

~\AppData\Local\Programs\Python\Python36\lib\contextlib.py in __exit__(self, type, value, traceback)
     97                 value = type()
     98             try:
---> 99                 self.gen.throw(type, value, traceback)
    100             except StopIteration as exc:
    101                 # Suppress StopIteration *unless* it's the same exception that

c:\Users\niels.rogge\Documents\Python projecten\testing_tapas\env\lib\site-packages\tensorflow\python\eager\context.py in execution_mode(mode)
   1987     finally:
   1988       ctx.executor = executor_old
-> 1989       executor_new.wait()
   1990 
   1991 

c:\Users\niels.rogge\Documents\Python projecten\testing_tapas\env\lib\site-packages\tensorflow\python\eager\executor.py in wait(self)
     65   def wait(self):
     66     """Waits for ops dispatched in this executor to finish."""
---> 67     pywrap_tfe.TFE_ExecutorWaitForAllPendingNodes(self._handle)
     68 
     69   def clear_error(self):

DataLossError: corrupted record at 0

Am I doing something wrong here?

ghost commented 4 years ago

The data creation looks okay.

It's important to point out that only the files in the tf_examples directory are in TF example format. The files in the interaction directory are also TF records but they hold serialized interaction protos.

Which files are you trying to open?

I am wondering whether it's a TF 1 / 2 issue.

Can you try this:

import tensorflow.compat.v1 as tf
tf.disable_v2_behavior()

def iterate_examples(filepath):
  for value in tf.python_io.tf_record_iterator(filepath):
    i = tf.train.Example()
    i.ParseFromString(value)
    yield i
ghost commented 4 years ago

Actually never mind both code snippets should work fine.

TF examples are compressed with GZIP by default when using run_task_main.py:

flags.DEFINE_string(
    'compression_type',
    'GZIP',
    "Compression to use when reading tfrecords. '' for no compression.",
)

I think you need to specify this when reading the data as well:

tf.data.TFRecordDataset(
    filenames, compression_type="GZIP",
)
NielsRogge commented 4 years ago

Awesome, setting the compression_type parameter to "GZIP" works and lets me read in the data.

I ran the data creation again, without compression, because I'm using a package to read in tfrecords as PyTorch datasets, which currently does not support compression types. What it basically does is looking at each of the features and yielding a dictionary of keys (feature names) and values (numpy arrays):

description = 
{'aggregation_function_id': 'int',
 'answer': 'float',
 'classification_class_index': 'int',
 'column_ids': 'int',
 'column_ranks': 'int',
 'input_ids': 'int',
 'input_mask': 'int',
 'inv_column_ranks': 'int',
 'label_ids': 'int',
 'numeric_relations': 'int',
 'numeric_values': 'float',
 'numeric_values_scale': 'float',
 'prev_label_ids': 'int',
 'question_id': 'byte',
 'question_id_ints': 'int',
 'question_numeric_values': 'float',
 'row_ids': 'int',
 'segment_ids': 'int',
 'table_id': 'byte',
 'table_id_hash': 'int'}

features = {}
        for key, typename in description.items():
            if key not in all_keys:
                raise KeyError(f"Key {key} doesn't exist (select from {all_keys})!")
            # NOTE: We assume that each key in the example has only one field
            # (either "bytes_list", "float_list", or "int64_list")!
            field = example.features.feature[key].ListFields()[0]
            inferred_typename, value = field[0].name, field[1].value
            if typename is not None:
                tf_typename = typename_mapping[typename]
                if tf_typename != inferred_typename:
                    reversed_mapping = {v: k for k, v in typename_mapping.items()}
                    raise TypeError(f"Incompatible type '{typename}' for `{key}` "
                                    f"(should be '{reversed_mapping[inferred_typename]}').")

            # Decode raw bytes into respective data types
            if inferred_typename == "bytes_list":
                value = np.frombuffer(value[0], dtype=np.uint8)
            elif inferred_typename == "float_list":
                value = np.array(value, dtype=np.float32)
            elif inferred_typename == "int64_list":
                value = np.array(value, dtype=np.int32)
            features[key] = value

        yield features

However, for some reason, reading in the test set results in an overflow error:

    168                 value = np.array(value, dtype=np.float32)
    169             elif inferred_typename == "int64_list":
--> 170                 value = np.array(value, dtype=np.int32)
    171             features[key] = value
    172 

OverflowError: Python int too large to convert to C long

This might be a Windows-specific issue so first I"ll try it out in Google Colab.

Thank you for your help!