sooluh / kodepos

:postbox: Indonesian postal code search API by place name, village or city/regency.
https://kodepos.vercel.app
Apache License 2.0
89 stars 24 forks source link

Reporting data provenance #40

Open pandu-supriyono opened 2 weeks ago

pandu-supriyono commented 2 weeks ago

Assalamualaikum warahmatullahi wabarakatuh

Thank you for the library, I'm considering trying it out for a prototype.

One of the criteria that we validate prototypes on is data reliability, accuracy and maintenance.

Do you report anywhere where you get your data from?

I see it used to be scraped from Direktorikodepos. Am I right to assume that it is now statically scraped from this website?

I'm happy to help if there is any help necessary for developing a data provenance strategy.

sooluh commented 2 weeks ago

وعليكم السلام ورحمةاللّٰه وبركاته

First of all, thank you for your interest in this library project. This is an interesting question, and if you don't mind, we will convert this issue to a discussion and pin it so that this information can be easily accessed by many people.

Indeed, we initially used the concept of scraping to retrieve postal code data from existing websites such as carikodepos.com, nomorkodepos.com, and others. However, various issues started to arise, including returned data being an empty array (#19), returned data being null (#27), inconsistent responses (#29), and even internal server errors.

From these issues, we eventually decided to provide the data statically and update it periodically (if there are data updates from the government). We began looking for authentic data sources.

We found kodepos.posindonesia.co.id, which provides all postal code data in PDF format. We manually parsed it whenever we had free time, and the turning point was when an issue (which has now become a discussion) arose in #33. The websites we used for data retrieval went down, and we couldn't find similar websites anymore. We then rushed to complete all the data and committed it in #de11b82.

For coordinate (latitude and longitude) as well as elevation, we retrieved and synchronized the data programmatically (the script for which is currently missing) with data from www.opentopodata.org. For time zones, we matched them with provincial data that can be found on the internet, including:

That's all I can share. Thank you.

pandu-supriyono commented 1 week ago

Thank you for your reply. I have continued this thread as a discussion. Is it an idea to close this issue and proceed in #42?