open-sdg / sdg-build

Python package to convert SDG-related data and metadata between formats
MIT License
5 stars 23 forks source link

Allow multiple meta file inputs from different folders #170

Closed brockfanning closed 4 years ago

brockfanning commented 4 years ago

This is an attempt to allow, for example, a CSV meta input from one folder, and a YAML meta input from another folder.

This side-steps the meta.py and git.py include files, and in theory those could be deleted.

This needs plenty of testing. In particular:

LucyGwilliamAdmin commented 4 years ago

@brockfanning does anything need to be changed in data config file for this?

brockfanning commented 4 years ago

@LucyGwilliamAdmin Not that I know of.

LucyGwilliamAdmin commented 4 years ago

Do I need rows here for each language? https://github.com/LucyGwilliamAdmin/open-sdg-data-starter-1/blob/develop/open_sdg_config_sdmx.yml#L33

LucyGwilliamAdmin commented 4 years ago

I'm mainly asking as trying to see if possible for English xlsx meta input files to have English in first column and Russian xlsx meta input files to have Russian in first column (field name column) as long as all values are in the metadata-mapping file but I don't seem to be having much luck - do you know if this should be possible or not? If not, no probs, just trying to get an idea

brockfanning commented 4 years ago

I may be misunderstanding and/or misremembering how this works, but I believe that the non-default languages should be in subfolders. So for example if you are importing CSV metadata for a platform with Spanish as the default language, and English as a second language, you would put your Spanish metadata files in the folder specified with path_pattern, and then you would put an en subfolder in that folder. So if path_pattern is meta-csv then the Spanish files would go in meta-csv and the English files would go in meta-csv/en. Hopefully I'm not remembering that wrong.

brockfanning commented 4 years ago

Also here is the part of the new code related to the metadata mapping: https://github.com/open-sdg/sdg-build/pull/170/files#diff-34796376f53b09765921dffaa1834531R113-R136

LucyGwilliamAdmin commented 4 years ago

@brockfanning thanks, think that's the setup I've got, then for the Russian (default language) meta Excel input I have a column which contains human-readable field names in Russian and a column which contains field values in Russian. In the English (2nd language) meta Excel input I have a column which contains human-readable field names in English and a column which contains field values in English. I then have metadata mapping csv file which in the first column contains all human-readable field names (so Russian and English) and then the second column contains the machine readable field names (twice) so like a 2 to 1 mapping I guess.

Should this work, or do all human readable names need to be in one language?

LucyGwilliamAdmin commented 4 years ago

For example in meta I have 1-1-1.xlsx: image

In meta/en I have 1-1-1.xlsx: image

Then I have metadata-mapping.csv: image

brockfanning commented 4 years ago

Yep I think that should work (at least that's the intention).

LucyGwilliamAdmin commented 4 years ago

@brockfanning is the human_key supposed to be the index when the mapping is read in?

LucyGwilliamAdmin commented 4 years ago

This is an attempt to allow, for example, a CSV meta input from one folder, and a YAML meta input from another folder.

This side-steps the meta.py and git.py include files, and in theory those could be deleted.

This needs plenty of testing. In particular:

  • Does the git stuff still work? (metadata "last update date" fields getting populated automatically)
  • Does the multilingual subfolder approach still work for metadata?
  • Does the metadata mapping work?

@brockfanning I've just looked at latest changes and:

Re. multilingual subfolder approach - I'm not sure what's happening. Seems to be fine for Excel (all terms are translating). But when it comes to yaml input the fields aren't getting translated e.g. Graph title, Units of measurement, National geographical coverage

https://lucygwilliamadmin.github.io/open-sdg-site-starter-1/en/1-1-1/

Looking at the Data/Metadata last updated fields on the Indicator information tab, something seems to be up with thatm- metadata date isn't updating but data date isn't showing at all

brockfanning commented 4 years ago

@LucyGwilliamAdmin I added a commit to hopefully help with that translation issue.

About the last update dates, let's see if that commit also helps there. It probably won't though. One thorny problem: the last-updated-date for the data comes from looking at the last Git commit for the data file. But how can the code know whether that data file should be a .csv file or a .xml file? It may be tricky to support that last-update-date when the data not always in one type of file.

brockfanning commented 4 years ago

As a possible way to address the issue mentioned above, my latest commit looks for a metadata field called data_filename which can have the filename of the data for that indicator. So for example if the data for 1.1.1 is in 1-1-1.csv, the 1.1.1 metadata could include:

data_filename: 1-1-1.csv

And if the data for 1.2.1 is in 1-2-1.xml, the 1.2.1 metadata could include:

data_filename: 1-2-1.xml
LucyGwilliamAdmin commented 4 years ago

@brockfanning ok - just tried this:

https://lucygwilliamadmin.github.io/open-sdg-site-starter-1/en/1-1-1/

brockfanning commented 4 years ago

@LucyGwilliamAdmin My suspicion is that the Excel date is showing only because it's the last to run (see the order here).

I've added a check for meta_filename as well, which might help with this. However the order may still have an effect. I think the key will be to specify the meta_filename in the metadata file that you don't want to affect the last updated date.

For example, if you are using both 1-1-1.csv and 1-1-1.xlsx for metadata, and you don't want the 1-1-1.csv to affect the last updated date, then you would need to add meta_filename: 1-1-1.xlsx in the CSV file. My intention is that this will ensure that the CSV file doesn't use itself for the last updated date (though I haven't tested it).

LucyGwilliamAdmin commented 4 years ago

@LucyGwilliamAdmin My suspicion is that the Excel date is showing only because it's the last to run (see the order here).

I've added a check for meta_filename as well, which might help with this. However the order may still have an effect. I think the key will be to specify the meta_filename in the metadata file that you don't want to affect the last updated date.

Ah that's not so bad then - I think in this case at least, I would always want the Excel metadata date to show.

For example, if you are using both 1-1-1.csv and 1-1-1.xlsx for metadata, and you don't want the 1-1-1.csv to affect the last updated date, then you would need to add meta_filename: 1-1-1.xlsx in the CSV file. My intention is that this will ensure that the CSV file doesn't use itself for the last updated date (though I haven't tested it).

I have just tested this new field though and I'm getting an error: https://github.com/LucyGwilliamAdmin/open-sdg-data-starter-1/runs/1154682637?check_suite_focus=true

brockfanning commented 4 years ago

@LucyGwilliamAdmin Ah, I see that my latest code can't handle if the other file is in a different folder.

We could try adding yet another field, meta_filefolder. It's starting to get a bit complex though - I wonder how you would feel about removing that meta_filename code and just relying on the input ordering for that last-update-date issue.

LucyGwilliamAdmin commented 4 years ago

I think removing the field and rely on ordering - for now I foresee the need for files being in separate folders is to have the 'settings' in one and actual metadata in another

brockfanning commented 4 years ago

Sounds good - I've just reverted that last change.

LucyGwilliamAdmin commented 4 years ago

@brockfanning ok - I'm happy with everything I've tested so far - do you think there's anything else that needs to be tested?

brockfanning commented 4 years ago

I think that covers it. One thing though: meta.py and git.py can - in theory - be deleted now. Do you think we should go ahead and delete them in this PR? Or follow up in a separate PR for that?

LucyGwilliamAdmin commented 4 years ago

I think they could be deleted in this PR

brockfanning commented 4 years ago

@LucyGwilliamAdmin Ok, I've deleted those. For good measure do you think you could make sure that your data build still works?

Update: I tried it locally and the build completed without errors.