Ating pag-ibayuhin ang ating talahuluganan!
Collects Tagalog words from tagalog.pinoydictionary.com, a database of Tagalog words powered by Cyberspace.ph Web Hosting. This script uses a common web scraping technique known as HTML parsing.
See the word list at tagalog_dict.txt
[]()
Served through GitHub Pages, the scraped words are accessible via REST resource.
Host
https://raymelon.github.io/tagalog-dictionary-scraper/
Method
GET
Resources Available
Resource | Display | Endpoint |
---|---|---|
csv |
default |
/tagalog_dict.csv |
csv |
with lines |
/tagalog_dict_lines.csv |
json |
default |
/tagalog_dict.json |
json |
with lines |
/tagalog_dict_lines.json |
txt |
default |
/tagalog_dict.txt |
Each webpage is loaded and parsed, extracting the words enclosed in <h2 class='word-entry'>
tag.
Included is tagalog.pinoydictionary.com
html
snippet containing the source of
http://tagalog.pinoydictionary.com/list/a/
to serve as point of reference on how dictionary words from the page are extracted.
Disclaimer:
I do not own the html
code cited above, it is owned by tagalog.pinoydictionary.com.
The main purpose of this project is for a Scrabble ® Tagalog dictionary database, but other uses may vary.
python -m pip install -U pip beautifulsoup4
python -m pip install -U pip requests-futures
collect_tagalog.py
tagalog_dict.txt
max_workers
value with the CPU and network capacity of the environment. See the comment for estimated values and expected download rates.