open-numbers / ddf--unpop--world_population_prospects

0 stars 1 forks source link

Split up datapoint files per country #1

Closed jheeffer closed 8 years ago

jheeffer commented 8 years ago

Right now the datapoint files are > 100MB which is above the Github limits.

oh yes, two datapoint files, 127MB and 293MB. Github don't allow to send such big files (edited) we can try git-lfs https://git-lfs.github.com/ but this is a paid service https://help.github.com/articles/billing-plans-for-git-large-file-storage/ for usage > 1GB

https://help.github.com/articles/what-is-my-disk-quota/ https://help.github.com/articles/working-with-large-files/

Let's split datapoints up in smaller files per geo. We need this anyway for DDFcsv reader not to read a massive file of all geo's for showing only one geo.

Files names should follow the pattern ddf--datapoints--<indicator>--by--geo-<geo>--<other dimensions>.csv

e.g. ddf--datapoints--population--by--geo-usa--gender--age--year.csv ddf--datapoints--population--by--geo-swe--gender--age--year.csv ddf--datapoints--population--by--geo-chn--gender--age--year.csv

semio commented 8 years ago

but for this dataset we use country code for geo entity, see the entity file https://github.com/semio/ddf--unpop--wpp_pop/blob/master/ddf--entities--country_code.csv

Is it ok to rename base on the country code? such as

ddf--datapoints--population--by--country_code-926--year--gender--age.csv

jheeffer commented 8 years ago

oh yeah of course, sorry, I didn't think of the specifics in the dataset. Of course use the entity domain and id's of this dataset, you're right : ).

Later on these might be moved to systema globalis, and at that point, we will translate domains and id's.

semio commented 8 years ago

OK, I will update the repo.

I am thinking about how to integrate this with recipes. If we split the datapoints, then the index will have many entries of same datapoint with different keys.

key value file
geo-usa,time population ddf--datapoints--population--by--geo-usa--time.csv
geo-swe,time population ddf--datapoints--population--by--geo-swe--time.csv
... ... ...

Is this correct? If the index will look like this, then how should we write the recipe to create an ingredient of population by geo, time?

  1. allow setting multiple key filters for a datapoint ingredient. I think this is not acceptable because there are a few hundreds of geo here so there will be too much to write in the recipe..
  2. keep the current form of ingredient, use "geo,time" as key filter, and add function for chef to read all files that have geo, time in their keys. This way is better, but just the key in ingredient don't exist in the index

What do you think?

semio commented 8 years ago

An other question:

the validate-ddf tool reports that

{"id":"WRONG_INDEX_KEY","type":"Wrong Index key","path":".","data":["country_code-100","country_code-100","country_code-104","country_code-104","country_code-108","country_code-108","country_code-112","country_code-112", ...

It is because it can't find these concepts in ddf--concepts, but I think we don't need them in the concepts file right?

jheeffer commented 8 years ago

So the key columns to the table stays geo,time, thus in the index file it'll still be the key as well.

My initial idea was to use DDFQL filters in the index to specify which more fine-grained subset of datapoints can be found in the file.

https://docs.google.com/document/d/1aynARjsrSgOKsO1dEqboTqANRD1O9u7J_xmxy8m5jW8/edit#heading=h.bfg8rjfkj6it

What do you think of that?

key value filter file
geo,time population { geo: "usa" } ddf--datapoints--population--by--geo-usa--time.csv
geo,time population { geo: "swe" } ddf--datapoints--population--by--geo-usa--time.csv

This also allows for way more complex splitting than just this, but also makes reading the index more complex.

What do you think?

semio commented 8 years ago

we use constrains in datapackage for splitting of datapoints. This issue should be ok to close for now.