samayo / country-json

A simple data of the world by country each in JSON format.
https://data.world/samayo
MIT License
1.06k stars 801 forks source link

Discussion #229

Closed samayo closed 2 months ago

samayo commented 2 years ago

Hello, this is to discuss about new major change to the repo.

I am trying to remove most countries not recognized by UN. Currently, there are 248 countries in this repo, but the UN recognizes only 193 of them, so this will be a big change.

Other than that, I will fill all data for each country (so, no null or empty values)

All data will be also automated (to be updated each week whenever something changes in the source like wikipedia)

Let me know if you like to keep this repo as per the UN recognized countries only

jezmck commented 1 year ago

I think that list is a valid requirement for some people, but not everyone.

I'd add that as a new list within this repo.

iamdoubz commented 1 year ago

Just add another file in src called "recognized-un-country.json" with a 1 or 0 value. This will keep the existing structure and pushed the responsibility to the person(s) creating their application. Hope this helps.

samayo commented 1 year ago

Just add another file in src called "recognized-un-country.json" with a 1 or 0 value. This will keep the existing structure and pushed the responsibility to the person(s) creating their application. Hope this helps.

Thanks, but I don't think it would be nice to keep the existing structure. Some src files have more entries than others. The idea is for all files to contain all 193 countries in the same order, so if you want to get multiple data of one country from all files, it would be very convenient.

kennarddh commented 11 months ago

Does this data currently scrapped from wikipedia? if yes is it automated?

samayo commented 11 months ago

Yes it's scrapped of Wikipedia mostly. Automating the process has been the goal for a long but I can't find much time that's why it's not implemented

kennarddh commented 11 months ago

I can implement the automation but I still don't understand the wikipedia data.

https://en.wikipedia.org/wiki/List_of_countries_and_dependencies_by_population

In this wikipedia I don't understand what is the difference between numbered country and - country?

image

samayo commented 11 months ago

Note: A numbered rank is assigned to the 193 member states of the United Nations, plus the two observer states to the United Nations General Assembly. Dependent territories and constituent countries that are parts of sovereign states are not assigned a numbered rank

So numbered are officially recognised countries in numbered are disputed like Taiwan for example.

This repo should focus only on recognised countries

kennarddh commented 11 months ago

@samayo should this repo include the two observer states?

samayo commented 11 months ago

Yes I think that would be ok

kennarddh commented 11 months ago

Proposed Changes

Update

Removed

Added

Changed

Source

Note

samayo commented 11 months ago

Great point thanks for all the help so far you are making this easy even if I want to implement it.

Some notes:

Maybe we can use object like { 3LetterCountryCode: data } instead of [{ country: name, data: data }] to reduce size.

Lets leave the above as is for now because I don't see a reason to change that

I'm ok with removing alphabet letters but not country names, remember that there are many websites, games that need to display just country names for some reason

Other than those remarks everything else is a great idea

samayo commented 11 months ago

Btw I was recently thinking to give chatgpt the Wikipedia section that contains the data and ask it to generate the python code to convert the data from html to JSON and use that script every month to look for more updates.

The script would be made using python with scrappy I have an unfinished version of it in my local.

Once chatgpt creates the script and it works we upload the script to a server and with cronjob run it every month to scrap and send a pr request

That's what I thought initially, feel free to work upon the idea of provide your own

kennarddh commented 11 months ago

I have never used python since 2020 so I can't implement it in python.

Now I mostly use typescript with NodeJS.

Instead of vps we can use github action instead.

For Country name can we just use array? Like [ "A", "B" ].

Can we move this repo to new organization so I can add sdk If I can.

kennarddh commented 11 months ago

Where did you find the flag svg @samayo?

jezmck commented 11 months ago

Wherever they are, they 100% need to go through SVGO or a similar compressor.

kennarddh commented 11 months ago

I can't find the svg source to scrap. The image html always unstructured.

samayo commented 11 months ago

It's from Wikipedia. Check each country's flag page, it will have SVG format

kennarddh commented 11 months ago

@samayo https://en.wikipedia.org/wiki/File:Flag_of_the_United_States_(DoS_ECA_Color_Standard).svg This still need to go through svgo?

samayo commented 11 months ago

I don't understand your question. You will find a .svg file on every Wikipedia page and that must be converted to base64 format. We store in this repo a base64 representation of the svg

kennarddh commented 11 months ago

isn't svgo is for optimizing svg?

samayo commented 11 months ago

You still need to right click on the flag and select "open image in new tab..." Then you will see this URL

https://upload.wikimedia.org/wikipedia/commons/a/a9/Flag_of_the_United_States_%28DoS_ECA_Color_Standard%29.svg

That is the SVG the one you linked is html page

kennarddh commented 11 months ago

What is svgo for?

jezmck commented 11 months ago

That one is actually okay, but some flags are massive files. Try https://jakearchibald.github.io/svgomg/ on the more complex flags.

SVGOMG is just a GUI for SVGO.

kennarddh commented 11 months ago

Ok so wikipedia -> svgo optimize -> base64?

kennarddh commented 11 months ago

@samayo Can I make the scrapper with typescript instead of python?

samayo commented 11 months ago

I highly suggest python so I can contribute also but you decide. Where do you plan to host the script? Here or at your own GitHub page?

kennarddh commented 11 months ago

@samayo If you want to use python I can't contribute.

kennarddh commented 11 months ago

I can create new repo so I can use typescript instead. If you don't want @samayo.

samayo commented 11 months ago

I think I will give it a shot and you can also go ahead and try we can use one or the other or both. I am happy to get a regular pr from anyone

kennarddh commented 11 months ago

Ok i will create a pr later with typescript

kennarddh commented 11 months ago

@samayo Where to get Geo Coordinates?

kennarddh commented 11 months ago

255

kennarddh commented 11 months ago

@samayo should the data include the country even though the data is null

Like [{ country: 'x', data: null }] do we need to include this?

kennarddh commented 11 months ago

@samayo any update for my previous question?

samayo commented 11 months ago

@samayo Where to get Geo Coordinates?

all from wikipedia

samayo commented 11 months ago

@samayo should the data include the country even though the data is null

Like [{ country: 'x', data: null }] do we need to include this?

yes we should definitely add the country, we use ca use null, none, false or 0 You can pick any format as long as it is consistent. I prefer null since 0 could confuse users with other data

kennarddh commented 11 months ago

@samayo Where to get Geo Coordinates?

all from wikipedia

Can you add the link or the wikipedia page? I can't find it.

samayo commented 11 months ago

It seems I was wrong, it is not from wikipedia and the way the data is represented is not entirely optimal.

Can you use this instead? https://developers.google.com/public-data/docs/canonical/countries_csv

You can use other source. In any case, this data is unlikely to change so you can even exclude it

kennarddh commented 11 months ago

It seems I was wrong, it is not from wikipedia and the way the data is represented is not entirely optimal.

Can you use this instead? https://developers.google.com/public-data/docs/canonical/countries_csv

You can use other source. In any case, this data is unlikely to change so you can even exclude it

That data is different with this https://github.com/samayo/country-json/blob/0c522ea1e7ae88e9a2dd979322fbf8c2814b0de6/src/country-by-geo-coordinates.json

The data in google doesn't have west, south, north, east

samayo commented 11 months ago

It's fine, we can use whatever is close enough and if there is a need to improve it then we can do that later

kennarddh commented 11 months ago

@samayo I have some problem that the country name in different wikipedia page have different names

For example

We can compare the url but I don't know will it still be different. But I'll try.

Edit 1

Edit 2

samayo commented 11 months ago

I don't know about the redirect issue, but if you found the solution then it's good. About country names being different on different pages, i think we have to use a custom code logic for that. e.g., if(countryName = "Netherlands, Kingdom of the") {CountryName = "Netherland")

kennarddh commented 11 months ago

I don't know about the redirect issue, but if you found the solution then it's good. About country names being different on different pages, i think we have to use a custom code logic for that. e.g., if(countryName = "Netherlands, Kingdom of the") {CountryName = "Netherland")

If we use if like that the automation will be broken when the wikipedia page is updated and there is so many alias

samayo commented 11 months ago

that's unlikely, i don't know what you are planning to use but using python and panda, to find the table you are looking for very easily.

Take a look at this https://medium.com/analytics-vidhya/web-scraping-a-wikipedia-table-into-a-dataframe-c52617e1f451

from step 5, it is very easy to get all tables in the page and target the table you need.

So it's unlikely any changes will break as far as I think.

kennarddh commented 11 months ago

I can resolve the redirect issue using this https://en.wikipedia.org/w/api.php?action=help&modules=query%2Bredirects api.

kennarddh commented 11 months ago

@samayo In this wikipedia page https://en.wikipedia.org/wiki/List_of_country_calling_codes

Some countries have 2 or more codes

image

Which code should we include?

In the old json its just concat the 1 with 939 ignoring 787

{
  "country": "Puerto Rico",
  "calling_code": 1939
},
samayo commented 10 months ago

@kennarddh We have to use both separated by a comma, if you have better ideas let me know Thanks

kennarddh commented 10 months ago

@kennarddh We have to use both separated by a comma, if you have better ideas let me know Thanks

We can use something like this

{
  "country": "example",
  "data": [1787, 1938]
}
samayo commented 10 months ago

@kennarddh looks good for me

kennarddh commented 10 months ago

@samayo russia have like range code?

image

{
  "country": "russia",
  "data": [71, 72, 73, 74, 75, 78, 79]
}

Is this right?