Closed rchanumo closed 3 years ago
Hi @rchanumo sadly we do not have that code to share at this point basically because we are using an pre-parsed internal Wikipedia dump as a source. But we have shared the resulting pre-training dataset in https://github.com/google-research/tapas/blob/master/PRETRAIN_DATA.md
But there's nothing special about the logic and I believe there are other public repositories that contain logic to parse directly from a Wikipedia dump, for example there's a pipeline directly from html files here: https://github.com/wenhuchen/WikiTables-WithLinks/blob/master/preprocessing/pipeline.py and there are a few other parsers from the custom Wikimedia format. Let me know if that helps!
Thank you for the pointer @eisenjulian. @wenhuchen's code is sufficient for my use case.
Hi,
I am unable to find the code for extracting tables-text examples from wikipedia in the repo. Could you kindly point me to the same.
Thank you.