google-research / tapas

End-to-end neural table-text understanding models.
Apache License 2.0
1.15k stars 217 forks source link

When mode="create_data" , why are there some padded examples? #55

Closed lairikeqiA closed 4 years ago

lairikeqiA commented 4 years ago

To be specific, I took 10 examples from the test dataset as a small test dataset. When I created data, I got the results as follows: Num questions processed:10 Num examples:8 Num conversion errors:2 Padded with 24 examples. Why are there some padded examples? How many examples are produced under different conditions?

In addition, I got the last layer's outputs of bert and printed these. Why are there more outputs than samples in the test dataset? Why are there always some repetitive outputs at the end?

eisenjulian commented 4 years ago

This is due to the fact that on TPUs the batch size has to be constant for all batches, including the last one, so the examples at the end have to be padded to the nearest multiple of the batch size, and then it gets discarded after prediction. You can set the test_batch_size argument to the exact number of examples that you have (if it's small enough to fit) and create a single batch without any padding.