Open dkalantzi opened 4 months ago
Hi @dkalantzi, I'm building an application like Duolingo, but for the native dialects here in the Philippines. I've came across this tagalog web scraper by @raymelon , so here is the response regarding your issue:
You can changed the Regular Expression in line 161 of collect_tagalog.py
with:
all_words.append(re.compile('[ ]?\(+?[\s\w\d\W]+\)').sub('', word.next.next).lower())
\s
.\w
is for words, \d
is for decimals, \W
is for non-words like '&'.lower()
method.Check RegExr's RegEx reference for more information about Regular Expressions.
Here is the result of scraped words with the modified RegEx above. tagalog_dict.txt
You can adjust the regex to get the result that is in your liking.
Hello,
Thank you for this very useful resource. I've noticed two potential issues with the words in the scraped data:
Kind regards, Dimitra