healthyregions / chicago-environment-explorer

ChiVes harmonizes & standardizes environmental data across dozens of sources to make it accessible for full exploration, alongside a growing list of resources on the Chicago Environment, cultivated by a community of curators.
8 stars 1 forks source link

The data dictionary does not match the column names in shapefile [Bug] #158

Closed MStuhlmacher closed 3 months ago

MStuhlmacher commented 5 months ago

Describe the problem The column headings defined in the data dictionary on this page (https://chichives.com/data) do not match the column headings in the shapefile that can be downloaded from the same page. For example, the data dictionary says that the tree crown density heading is called "treeCrDn" but in the shapefile this heading is "tree_crow" or maybe "tree_den".

Expected behavior Expected behavior is that the data dictionary would match the headings in the shapefile and that the data dictionary would include all columns in the shapefile.

Environment info:

mradamcox commented 3 months ago

Thanks for this ticket @MStuhlmacher. Looked into it further and we needed to update our script that generates the downloadable dataset. Should have the fix for this deployed tomorrow.

mradamcox commented 3 months ago

@bodom0015 the problem goes a little deeper than just the need for auto-generating the public download file. Ultimately, the data dictionary that is shown in the Data page comes directly from the Columns sheet in the main Data Dictionary and Variables file, but these column names may not actually exist in the data (and nothing breaks if they don't). While each entry in the Variables sheet does have a column name that I thought had to link to the Columns sheet, but actually links to the column names in the data sheets themselves. For example, trees_crown_density (shortened to trees_crow during shapefile export) is the name in the data sheet, treeCrDn is what is listed in the Columns sheet.

So! The following things need to happen to bring this all together:

  1. Audit all column names in the Columns sheet, and where needed change them to:
    • camelCase
    • <= 10 characters long
  2. Update source data sheets to use these new names
    • This will require finding which source data sheet has each column in it, and then actually altering the header.
    • Very importantly, any metadata documents that explicitly mention column names will have to be updated as well
  3. Update column name as needed for each entry in the Variables sheet
    • I've added validation that pulls from the list of column names in the Columns sheet, which is why those names need to be set to match the actual data before doing this final step.

After all of this is done, checkout the branch from the linked PR above, and run yarn build:data, then check the following:

Anything else? Ultimately this will greatly improve the coherence of the backend data content within the app.

mradamcox commented 3 months ago

Also please fix the typo in this column name asthemaAdj (should be asthmaAdj).

mradamcox commented 3 months ago

This is now handled, it required a bit of work behind the scenes. The solution I outlined above didn't end up being feasible (we aren't able to change the column headers in the original source data), so here is the solution I came up with: