Hyphens and spaces omitted in scraped words

raymelon / tagalog-dictionary-scraper

Builds a Tagalog dictionary by collecting Tagalog words from tagalog.pinoydictionary.com

GNU General Public License v3.0

26 stars 15 forks source link

Hi @dkalantzi, I'm building an application like Duolingo, but for the native dialects here in the Philippines. I've came across this tagalog web scraper by @raymelon , so here is the response regarding your issue:

You can changed the Regular Expression in line 161 of collect_tagalog.py

https://github.com/raymelon/tagalog-dictionary-scraper/blob/37b6c0befd023b93dc8b7eee4a866ae6d21a0f14/collect_tagalog.py#L161

with:

all_words.append(re.compile('[ ]?\(+?[\s\w\d\W]+\)').sub('', word.next.next).lower())

The '[ ]' checks for whitespaces, you can also write \s.
The following flags: \w is for words, \d is for decimals, \W is for non-words like '&'.
Lastly all words will be converted to lowercase with lower() method.

Check RegExr's RegEx reference for more information about Regular Expressions.

Here is the result of scraped words with the modified RegEx above. tagalog_dict.txt

You can adjust the regex to get the result that is in your liking.

raymelon / tagalog-dictionary-scraper

Hyphens and spaces omitted in scraped words #3