google-research / tapas

End-to-end neural table-text understanding models.
Apache License 2.0
1.15k stars 217 forks source link

About the pre-training data #2

Closed wenhuchen closed 4 years ago

wenhuchen commented 4 years ago

Hi, I'm quite curious about what the pre-training data looks like. Is there any chance you can demonstrate a small subsample in the repo.

muelletm commented 4 years ago

We are in the process of checking whether we can release all or a part of the data. I'll update in a few weeks.

In the meantime, I can give a concrete example.

For a Wikipedia page such as this one.

For the Medal Table, we will extract the table and then the following snippets:

vibhavagarwal5 commented 4 years ago

@muelletm Is the wikitables data same as this http://websail-fe.cs.northwestern.edu/TabEL/index.html ?

muelletm commented 4 years ago

No, it's not the same data. TabEL while also extracted from Wikipedia is quite a bit smaller (I think it doesn't contain info-tables for example). We actually tried pre-training from that data but results where quite a bit worse.

We hope we will be able to update on this in early June.

shashankMadan-designEsthetics commented 4 years ago

We are in the process of checking whether we can release all or a part of the data. I'll update in a few weeks.

In the meantime, I can give a concrete example.

For a Wikipedia page such as this one.

For the Medal Table, we will extract the table and then the following snippets:

  • the page title "IHF World Women's Outdoor Handball Championship"
  • the page description "IHF World Women's Outdoor Handball Championship was ...-1960."
  • the segment title "Medal table"
  • caption and segment text (empty in this case)

Hi @muelletm thanks for the clarifications. But i don't see how or where are the questions/queries taken from. As you have mentioned the textual data and table data pls clarify on the "query" part of the dataset as it's use was mentioned in the paper.

muelletm commented 4 years ago

The query part is what we call the text snippet in the paper.

It's text we find on the page of the table, I gave an example in my post above:

We are in the process of checking whether we can release all or a part of the data. I'll update in a few weeks.

In the meantime, I can give a concrete example.

For a Wikipedia page such as this one.

For the Medal Table, we will extract the table and then the following snippets:

  • the page title "IHF World Women's Outdoor Handball Championship"
  • the page description "IHF World Women's Outdoor Handball Championship was ...-1960."
  • the segment title "Medal table"
  • caption and segment text (empty in this case)
muelletm commented 4 years ago

Happy to announce that we released the data:

https://github.com/google-research/tapas/blob/master/PRETRAIN_DATA.md

vibhavagarwal5 commented 4 years ago

Thanks @muelletm