We need a script to generate a dataset for experiment. Our current dataset is ALTA-2010 Shared Task. In the case for the need of more language, annotation, shorter text or whatever else, we need to be able to generate a similar dataset.
Step needed:
Download Wikipedia dumps. Wikipedia texts are named in a format of xxwiki. All we need here are "current versions only" dumps.
Apply the methodology of the paper. You can easily get interlanguage links from the corresponding wiki page. (use a library BeautifulSoup4, find tags with a class "interlanguage-link")
I committed a utility code to help downloading Wikipedia dump files. It can also be used for other purposes. You'll need BeautifulSoup4 and requests to run it.
We need a script to generate a dataset for experiment. Our current dataset is ALTA-2010 Shared Task. In the case for the need of more language, annotation, shorter text or whatever else, we need to be able to generate a similar dataset.
Step needed: