summerlight / anlp

Applied Natural Language Processing project
Apache License 2.0
2 stars 2 forks source link

Collect dataset #6

Closed summerlight closed 8 years ago

summerlight commented 8 years ago

We need a script to generate a dataset for experiment. Our current dataset is ALTA-2010 Shared Task. In the case for the need of more language, annotation, shorter text or whatever else, we need to be able to generate a similar dataset.

Step needed:

  1. Download Wikipedia dumps. Wikipedia texts are named in a format of xxwiki. All we need here are "current versions only" dumps.
  2. Extract only text using wikiextractor.
  3. Apply the methodology of the paper. You can easily get interlanguage links from the corresponding wiki page. (use a library BeautifulSoup4, find tags with a class "interlanguage-link")
summerlight commented 8 years ago

I committed a utility code to help downloading Wikipedia dump files. It can also be used for other purposes. You'll need BeautifulSoup4 and requests to run it.

mytony commented 8 years ago

move on to step 3