Closed wenhuchen closed 4 years ago
We are in the process of checking whether we can release all or a part of the data. I'll update in a few weeks.
In the meantime, I can give a concrete example.
For a Wikipedia page such as this one.
For the Medal Table, we will extract the table and then the following snippets:
@muelletm Is the wikitables data same as this http://websail-fe.cs.northwestern.edu/TabEL/index.html ?
No, it's not the same data. TabEL while also extracted from Wikipedia is quite a bit smaller (I think it doesn't contain info-tables for example). We actually tried pre-training from that data but results where quite a bit worse.
We hope we will be able to update on this in early June.
We are in the process of checking whether we can release all or a part of the data. I'll update in a few weeks.
In the meantime, I can give a concrete example.
For a Wikipedia page such as this one.
For the Medal Table, we will extract the table and then the following snippets:
- the page title "IHF World Women's Outdoor Handball Championship"
- the page description "IHF World Women's Outdoor Handball Championship was ...-1960."
- the segment title "Medal table"
- caption and segment text (empty in this case)
Hi @muelletm thanks for the clarifications. But i don't see how or where are the questions/queries taken from. As you have mentioned the textual data and table data pls clarify on the "query" part of the dataset as it's use was mentioned in the paper.
The query part is what we call the text snippet in the paper.
It's text we find on the page of the table, I gave an example in my post above:
We are in the process of checking whether we can release all or a part of the data. I'll update in a few weeks.
In the meantime, I can give a concrete example.
For a Wikipedia page such as this one.
For the Medal Table, we will extract the table and then the following snippets:
- the page title "IHF World Women's Outdoor Handball Championship"
- the page description "IHF World Women's Outdoor Handball Championship was ...-1960."
- the segment title "Medal table"
- caption and segment text (empty in this case)
Happy to announce that we released the data:
https://github.com/google-research/tapas/blob/master/PRETRAIN_DATA.md
Thanks @muelletm
Hi, I'm quite curious about what the pre-training data looks like. Is there any chance you can demonstrate a small subsample in the repo.