hermitdave / FrequencyWords

Repository for Frequency Word List Generator and processed files
MIT License
1.16k stars 553 forks source link

Generate Dataset for OpenSubtitles 2018 #9

Closed hermitdave closed 5 years ago

hugolpz commented 5 years ago

See http://opus.nlpl.eu/OpenSubtitles2018.php :smile: To have it under hands ! :smiley:

Method:

I found out Subtlex-pl (2014) discussion interesting. I cross checked below with Lison & Tiedemann (2016), they barely talk about duplicata. So maybe Lison & Tiedemann did the anti-duplicatia work --very likely given the extensive processing on the data--.

Subtlex-pl (2014) : article, non-free. The section on data clean up is interesting. Some part are at reach. Other not. 3 methods are mentioned :

  1. Remove non target-language (~5% of files): Check if top 30 words of one file match 10% occurence of all files
  2. Remove duplicate or variations of files (~80% of files).
  3. Remove non-words, proper names: Check if words are also in publicly available spellchecker.

Corpus compilation, cleaning, and processing We processed about 105,000 documents containing film and television subtitles flagged as Polish by the contributors of http://opensubtitles.org. All subtitle-specific text formatting was removed before further processing. (1) To detect documents containing large portions of text in languages other than Polish, we first calculated preliminary word frequencies on the basis of all documents and then removed from the corpus all files in which the 30 most frequent types did not cover at least 10 % of a total count of tokens in the file. Using this method, 5,365 files were removed from the corpus. (2) Because many documents are available in multiple versions, it was necessary to remove duplicates from the corpus. To do so, we first performed a topic analysis using Latent Dirichlet Allocation (Blei, Ng, & Jordan, 2003), assigning each file to one of 600 clusters. If any pair of files within a cluster had an overlap of at least 10 % unique word-trigrams, the file with the highest number of hapax legomena (words occurring only once) was removed from the corpus, since more words occurring once would indicate more misspellings. After removing duplicates, 27,767 documents remained, containing about 146 million tokens (individual strings, including punctuation marks, numbers, etc.). (3) From these, 101 million tokens (449,300 types) were accepted as correctly spelled Polish words by the Aspell spell-checker (http://aspell.net/; Polish dictionary available at ftp://ftp.gnu.org/ gnu/aspell/dict/pl/) and consisted only of legal Polish, alphabetical characters. All words were converted to lowercase before spell-checking. Because Aspell rejects proper names spelled with lowercase, this number does not include proper names.

P. Lison and J. Tiedemann (2016), "OpenSubtitles2016: Extracting Large Parallel Corpora from Movie and TV Subtitles", In Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC 2016) http://stp.lingfil.uu.se/~joerg/paper/opensubs2016.pdf .

Furthermore, the administrators of OpenSubtitles have introduced over the last years various mechanisms to sanitise their database and remove duplicate, spurious or misclassified subtitles

Also related to : #2

hermitdave commented 5 years ago

@hugolpz apologies for the delays.. i downloaded the tar.gz files on another machine and never got around to running it. Based on early user feedback, I implemented checks to ensure duplicates were reduced. The subtitle folders often contain multiple subtitles. I have taken to only picking up a single file if multiple files are found. Programmatically identifying language is harder.

I could do basic checks like this language should only have latin / cryllic character set.

hermitdave commented 5 years ago

I am going to use this project to do language detection https://github.com/TechnikEmpire/language-detection

Its a .NET port of https://code.google.com/archive/p/language-detection/

hugolpz commented 5 years ago

Hello Dave, cool to see you back, Thanks for considering my input.

I could do basic checks like this language should only have latin / cyrillic character set. I think you "latin/cyrillic" would mean russian and similar, as subtitles often contains latin words for all languages, at least for "ok" and other basic ones.

As for :

I am going to use this project to do language detection

  • Java (original) : 99% over precision for 53 languages
  • .NET
  • Python : supports 55 languages out of the box.

Priority to get job done. Then, use more popular languages (python, js, java) so community can jump in if want.

Or others :

hermitdave commented 5 years ago

Thanks @hugolpz I have reworked the code and added language lookup using .NET for now. Its slowly chugging along generating the files as we speak. I am downloading the dataset again - they changed from tar.gz to zip so wanted to ensure i had the latest set..

I will start uploading soon

hugolpz commented 5 years ago

Ahahahahahahahahaha. I didn't expected that fast :+1:

I tried to create my own list and I bumped into lot of pollution. I share with you my findings: lot of English names and basic English in French subtitles.

Noise: Lot of characters names, a bunch of basic English words. I review and cleaned up the 6000 to 8000th list: out of 2000 items, I had to edit 206 modifications (10%) and make 82 deletions (4%) (diff, open view-source:https://lingualibre.fr/index.php?title=List%3AFra%2Fsubtlex-for-user-Penegal-06001-to-08000&type=revision&diff=83866&oldid=83864 and search for "diff-deletedline" and "diff-addedline").

On my side, I'am building a list of "words to preprocess before calculating stats". Some are to delete (Individual names). Some are to edit, to move all to lowercase or to capitalized. This kind of thing?

hermitdave commented 5 years ago

@hugolpz i am at home (case of mild flu) which means I can do other things.

That's pretty neat 👍 I force lower casing to ensure small reduction in noise

hermitdave commented 5 years ago

Maybe I should create another output with the words / characters that were filtered out ?

hugolpz commented 5 years ago

Yes, forcing lower case is smart. For names I had to inject back the Capitalization manually. But I'am not sure it really worth it has my end goal is to record words : my recording can be all lower case as well.