kanerv / TuukkaSaanaKanerva

Course project for building NLP applications
0 stars 0 forks source link

Use some other data, e.g. from Finnish Wikipedia, as in the example. #13

Closed TuukkaOT closed 3 years ago

TuukkaOT commented 3 years ago
  1. If you like, you can use some other data, for instance a Wikipedia dump for some other language, such as Finnish: https://linguatools.org/tools/corpora/wikipedia-monolingual-corpora/. First you need to download a Wikipedia XML file. Then you need to uncompress it with bunzip2. Then you need to convert the XML format to plain text using the Perl script xml2txt.pl, which is available for download on the web page. You need to use the option -articles in order to preserve the article tags: perl xml2txt.pl -articles INPUT_FILE.xml OUTPUT_FILE.txt.
TuukkaOT commented 3 years ago

My computer crashed when I ran the code with the wikipedia dump, so I limited the app to iterate only the first 1000 lines for now. Also, the dump was >900GB, so github won't accept it. I changed the code so that user inputs a path to a plain text file that they have locally. Also, when you convert the file with xml2txt, remember to add -articles as a parametre, because otherwise the program cannot split the articles.

kanerv commented 3 years ago

I tried to run our program with the Finnish Wikipedia corpus. I worked when the program just accessed the first 100000 lines. I removed that piece of code and tried with the whole thing but my computer wasn't able to handle it either. I was really hopeful since it ran for several minutes but sadly crashed at the end.

I don't know how to improve our program to handle bigger files with the capacity of our home laptops. I was just gonna ask this on Slack, but I saw you had gone there already Tuukka! 😄 Let's see if Mathias and Raul have any suggestions!

TuukkaOT commented 3 years ago

I'm just glad the problem wasn't my computer :) I added a progress bar for when the file is loading, so at least it's less frustrating to wait knowing that the program is still running.