raymelon / tagalog-dictionary-scraper

Builds a Tagalog dictionary by collecting Tagalog words from tagalog.pinoydictionary.com
GNU General Public License v3.0
26 stars 15 forks source link

Hyphens and spaces omitted in scraped words #3

Open dkalantzi opened 4 months ago

dkalantzi commented 4 months ago

Hello,

Thank you for this very useful resource. I've noticed two potential issues with the words in the scraped data:

Kind regards, Dimitra

nikitimi commented 7 hours ago

Hi @dkalantzi, I'm building an application like Duolingo, but for the native dialects here in the Philippines. I've came across this tagalog web scraper by @raymelon , so here is the response regarding your issue:

You can changed the Regular Expression in line 161 of collect_tagalog.py

https://github.com/raymelon/tagalog-dictionary-scraper/blob/37b6c0befd023b93dc8b7eee4a866ae6d21a0f14/collect_tagalog.py#L161

with:

all_words.append(re.compile('[ ]?\(+?[\s\w\d\W]+\)').sub('', word.next.next).lower())

Check RegExr's RegEx reference for more information about Regular Expressions.

Here is the result of scraped words with the modified RegEx above. tagalog_dict.txt

You can adjust the regex to get the result that is in your liking.