google-research / tapas

End-to-end neural table-text understanding models.
Apache License 2.0
1.15k stars 217 forks source link

Can't find code for extracting tables-text examples from wikipedia #99

Closed rchanumo closed 3 years ago

rchanumo commented 3 years ago

Hi,

I am unable to find the code for extracting tables-text examples from wikipedia in the repo. Could you kindly point me to the same.

Thank you.

eisenjulian commented 3 years ago

Hi @rchanumo sadly we do not have that code to share at this point basically because we are using an pre-parsed internal Wikipedia dump as a source. But we have shared the resulting pre-training dataset in https://github.com/google-research/tapas/blob/master/PRETRAIN_DATA.md

But there's nothing special about the logic and I believe there are other public repositories that contain logic to parse directly from a Wikipedia dump, for example there's a pipeline directly from html files here: https://github.com/wenhuchen/WikiTables-WithLinks/blob/master/preprocessing/pipeline.py and there are a few other parsers from the custom Wikimedia format. Let me know if that helps!

rchanumo commented 3 years ago

Thank you for the pointer @eisenjulian. @wenhuchen's code is sufficient for my use case.