Closed jheeffer closed 8 years ago
but for this dataset we use country code for geo entity, see the entity file https://github.com/semio/ddf--unpop--wpp_pop/blob/master/ddf--entities--country_code.csv
Is it ok to rename base on the country code? such as
ddf--datapoints--population--by--country_code-926--year--gender--age.csv
oh yeah of course, sorry, I didn't think of the specifics in the dataset. Of course use the entity domain and id's of this dataset, you're right : ).
Later on these might be moved to systema globalis, and at that point, we will translate domains and id's.
OK, I will update the repo.
I am thinking about how to integrate this with recipes. If we split the datapoints, then the index will have many entries of same datapoint with different keys.
key | value | file |
---|---|---|
geo-usa,time | population | ddf--datapoints--population--by--geo-usa--time.csv |
geo-swe,time | population | ddf--datapoints--population--by--geo-swe--time.csv |
... | ... | ... |
Is this correct? If the index will look like this, then how should we write the recipe to create an ingredient of population by geo, time?
What do you think?
An other question:
the validate-ddf tool reports that
{"id":"WRONG_INDEX_KEY","type":"Wrong Index key","path":".","data":["country_code-100","country_code-100","country_code-104","country_code-104","country_code-108","country_code-108","country_code-112","country_code-112", ...
It is because it can't find these concepts in ddf--concepts, but I think we don't need them in the concepts file right?
So the key columns to the table stays geo,time
, thus in the index file it'll still be the key as well.
My initial idea was to use DDFQL filters in the index to specify which more fine-grained subset of datapoints can be found in the file.
What do you think of that?
key | value | filter | file |
---|---|---|---|
geo,time | population | { geo: "usa" } | ddf--datapoints--population--by--geo-usa--time.csv |
geo,time | population | { geo: "swe" } | ddf--datapoints--population--by--geo-usa--time.csv |
This also allows for way more complex splitting than just this, but also makes reading the index more complex.
What do you think?
we use constrains
in datapackage for splitting of datapoints. This issue should be ok to close for now.
Right now the datapoint files are > 100MB which is above the Github limits.
https://help.github.com/articles/what-is-my-disk-quota/ https://help.github.com/articles/working-with-large-files/
Let's split datapoints up in smaller files per geo. We need this anyway for DDFcsv reader not to read a massive file of all geo's for showing only one geo.
Files names should follow the pattern
ddf--datapoints--<indicator>--by--geo-<geo>--<other dimensions>.csv
e.g.
ddf--datapoints--population--by--geo-usa--gender--age--year.csv
ddf--datapoints--population--by--geo-swe--gender--age--year.csv
ddf--datapoints--population--by--geo-chn--gender--age--year.csv