Feat 2.2.4 table formatting

A few thoughts on setting up these checks:

Libraries: I'd suggest geopandas over pyshp. I'm imaging the performance is going to be much better for spatial and table operations, particularly with pyarrow (gpd.read_file(file, use_arrow=True)). Complex table syntax is also probably going to be easier with something pandas based rather than iterating through rows. I'm also seeing pyshp hasn't had a release in over 2 years, which makes it seem like the project might be abandoned now.
File formats: I find GeoPackages easier to work with than Shapefiles. Since the original files are ESRI JSON from an ArcGIS Server, I think we can choose among ourselves what our preferred file format for working in is. We should definitely use something with a spatial index (GPKG / FlatGeobuf / SHP) rather than GeoJSON.
Sub-module structure: Knowing we're going to have a number of different kinds of checks, and to keep things modular and easy to work with in parallel, this is the workflow / structure I had in mind:
- Checks are broken up into sub-modules based on theme: spatial coverage, spatial nesting, table completeness, table formatting, P-Codes, attribute fields, etc. These are called in ways like python -m src.spatial_nesting, python -m src.table_completeness, etc. There can be a __main__.py which calls every sub-module in sequence the way I suggested in my branch with make run.
- Each sub-module uses the metadata.csv file to iterate through the list of 164 locations and outputs the result of its check to something like data/spatial_coverage.csv, data/table_completeness.csv, etc. The content of these files is a key (iso3) along with however many columns are needed to add information about the results of the check, whether they're numerical counts, decimal percentages, or boolean pass / fails.
- Calculating the final composite will take the metadata.csv and join it together with all the individual check CSVs through the iso3 column to make the final result, which we can think about formatting nicely for delivery to the people who will take action from it. I'm thinking something like using pandas to highlight cells in Excel showing issues to be fixed, or making automated PDF reports with Plotly / Jinja2 / WeasyPrint.

@maxmalynowsky you raise a lot of valid points. I've got more questions than answers at this stage:

A few thoughts on setting up these checks:

Libraries: I'd suggest geopandas over pyshp. I'm imaging the performance is going to be much better for spatial and table operations, particularly with pyarrow (gpd.read_file(file, use_arrow=True)). Complex table syntax is also probably going to be easier with something pandas based rather than iterating through rows. I'm also seeing pyshp hasn't had a release in over 2 years, which makes it seem like the project might be abandoned now.

Do we understand yet who the user is (who is running the code) and where/how they were running it? It wasn't clear to me if this is being run in a pipeline, in automated scripts periodically, or by users to check individual datasets before they start making maps with them. geopandas relies on the GDAL/PROJ/GEOS stack, which is a pain in the backside to install on many machine types (and if this is being used mainly by people making maps it'll be run on old Windows machines). Also I thought (but now aren't confident) that in the kick off call Shapefiles were named as the format of choice? Because of this I went with pyshp which is python only, which simplifies a lot of things. But if this is mainly being run in pipelines that we control, I'd be happier with geopandas as we can run it from docker with less installation/maintenance headaches. @leonbaruah / @vshlemon / @ddebarros-mapaction can you shed any light?

Your points on performance are correct, but I'd guess the bigger performance bottleneck will be the network I/O for downloading the files rather than the C vs pure python overhead. I'm not quite sure I buy your argument the project is abandoned either - it looks alive to me.

File formats: I find GeoPackages easier to work with than Shapefiles. Since the original files are ESRI JSON from an ArcGIS Server, I think we can choose among ourselves what our preferred file format for working in is. We should definitely use something with a spatial index (GPKG / FlatGeobuf / SHP) rather than GeoJSON.

I agree - Shapefiles always feel very legacy to me. However, this probably comes back to the users again. If the users are mainly running this on individual files before mapping, Shapefiles are probably the best format. If it's mostly being run in other ways, another format might be best. That would also feed into the library choices in point 1. above.

Sub-module structure: Knowing we're going to have a number of different kinds of checks, and to keep things modular and easy to work with in parallel, this is the workflow / structure I had in mind:

Checks are broken up into sub-modules based on theme: spatial coverage, spatial nesting, table completeness, table formatting, P-Codes, attribute fields, etc. These are called in ways like python -m src.spatial_nesting, python -m src.table_completeness, etc. There can be a __main__.py which calls every sub-module in sequence the way I suggested in my branch with make run.

I'm happy with this - I just hadn't decided on it yet as I'm still not clear on how it'll be used 🙂 if you know more on the usage and the users then please shout - or feel free to point me towards docs if they exist.

Each sub-module uses the metadata.csv file to iterate through the list of 164 locations and outputs the result of its check to something like data/spatial_coverage.csv, data/table_completeness.csv, etc. The content of these files is a key (iso3) along with however many columns are needed to add information about the results of the check, whether they're numerical counts, decimal percentages, or boolean pass / fails.

Calculating the final composite will take the metadata.csv and join it together with all the individual check CSVs through the iso3 column to make the final result, which we can think about formatting nicely for delivery to the people who will take action from it. I'm thinking something like using pandas to highlight cells in Excel showing issues to be fixed, or making automated PDF reports with Plotly / Jinja2 / WeasyPrint.

I see a bunch of extra file I/O with this plan that might be a little inefficient, but on the flip side might make parallelising the jobs easier. Again, it might depend on where we are running it as to if this architecture is what we want or not.

@Murray2015, those are also all good points. Based on the talks I've had with @vshlemon, there seem to be 2 paths we can take with this project. It might be that both are beneficial and needed.

Live monitoring:
- summary: cloud based monitoring service ensuring the quality of live data, delivering results of checks to GIS analysts in spreadsheets and reports.
- user: OCHA Centre for Humanitarian Data
- repositories: https://github.com/OCHA-DAP
- current data quality tool: https://github.com/OCHA-DAP/hdx-scraper-pcodes
Workflow processing:
- summary: command line tool as part of the pre-deployment process checking individual files only and giving terminal based summaries.
- user: University of Georgia, Information Technology Outreach Services
- repositories: https://github.com/UGA-ITOSHumanitarianGIS
- current data quality tool: https://github.com/UGA-ITOSHumanitarianGIS/mapservicedoc

Comparing the two, there's much more capacity at the Centre for Humanitarian Data to host this project, and it fits in well with the many data scrapers and pipelines they already manage. There also isn't anything that comprehensively checks the entire catalogue of all CODs currently published, which I feel is a critical gap in the current system. We should have a meeting among ourselves and the two groups above to check in.

I don't want to merge this of my own accord because I think @leonbaruah & @maxmalynowsky would have a better idea of the subject matter and what tests are intended to be run & whether these meet the project requirements.

Also since @maxmalynowsky has downloaded in .gpkg format how is that going to bode for the tests as these have mostly been written for .shp files no @Murray2015 ?

It seems to me we might want to decide on who the users will be and how we will direct the project to support them before moving much further?

Should the tests be a suite of functions the user can run on all the files independently of using the downloader in the repo
In that case should tests be written for every possible format they will have the COD files in (.gpkg, .shp etc.)
Or should we focus on providing a converter interface so whatever file format they have we convert it to the one the test suite is written for (which converters should we then design & is this going to be very tricky - perhaps we should prohibit them to only be able to use the test-suite downstream from the repo downloader so we control file-format input etc.?)
Do we want to keep a human-explaining record of the tests that have been implemented in a README at the front of the repo so we can explain what is being tested for, logic of implementation, and why it is important + how it contributes to the final priority etc. Could be valuable for holding it accountable and enabling discussion/extension

Perhaps @ddebarros-mapaction & @jduarte-mapaction you could help us roadmap this as it might be the tool is intended for multiple use-cases and execution methods (by pipeline, by individual, with already downloaded files, to download them etc.) and users (ocha, universities, country offices etc.) and this might need us to be more concrete with the features of the tool and how to phase the development (e.g. some users/executions/file-formats/features might be more important, some nice to have but not critical etc.)?

@vshlemon and @maxmalynowsky, I've tracked down answers to these questions. I've posted a summary in Slack, but I will repeat here the bits that are important for dev work:

Goal - give a (subjective) score from 0 - 1 on the quality and reliability of COD Admin Boundary datasets to allow users to choose where to spend time improving datasets.
Output - tests should output a number between 0 and 1 representing their score, with the methodology based on the confluence pages. The overall goal (relevant for running against all datasets) is to provide a traffic light system, as illustrated at the bottom of the confluence page. .csv might be a good choice. User - The main user is ITOS. The user will run the code against one, a few, or all COD datasets at once. How code will be run - this is not fully understood, and MapAction is holding a meeting in the next few weeks to gather requirements on this. The most likely way it will run is in pipelines or scripts. We should also expect some users to run the code manually against datasets in a local environment. Code deployment, operating systems, etc. - this is still unknown. @maxmalynowsky you might have the best idea of this given the directive that ITOS should be considered the main user

My takeaways/suggestions

This should unblock us enough to move forward and accept most of @maxmalynowsky's initial suggestions in this MR. I'll change this MR to use Geopandas over Pyshape, and this clears us to prefer GPKG over other formats.

Closing this PR as there is a new branch with the refactor

mapaction / cod-ab-data-quality

Feat 2.2.4 table formatting #4