Closed Murray2015 closed 2 months ago
@maxmalynowsky you raise a lot of valid points. I've got more questions than answers at this stage:
A few thoughts on setting up these checks:
- Libraries: I'd suggest
geopandas
overpyshp
. I'm imaging the performance is going to be much better for spatial and table operations, particularly withpyarrow
(gpd.read_file(file, use_arrow=True)
). Complex table syntax is also probably going to be easier with something pandas based rather than iterating through rows. I'm also seeingpyshp
hasn't had a release in over 2 years, which makes it seem like the project might be abandoned now.
Do we understand yet who the user is (who is running the code) and where/how they were running it? It wasn't clear to me if this is being run in a pipeline, in automated scripts periodically, or by users to check individual datasets before they start making maps with them. geopandas
relies on the GDAL/PROJ/GEOS
stack, which is a pain in the backside to install on many machine types (and if this is being used mainly by people making maps it'll be run on old Windows machines). Also I thought (but now aren't confident) that in the kick off call Shapefiles were named as the format of choice? Because of this I went with pyshp
which is python only, which simplifies a lot of things. But if this is mainly being run in pipelines that we control, I'd be happier with geopandas
as we can run it from docker with less installation/maintenance headaches. @leonbaruah / @vshlemon / @ddebarros-mapaction can you shed any light?
Your points on performance are correct, but I'd guess the bigger performance bottleneck will be the network I/O for downloading the files rather than the C vs pure python overhead. I'm not quite sure I buy your argument the project is abandoned either - it looks alive to me.
- File formats: I find GeoPackages easier to work with than Shapefiles. Since the original files are ESRI JSON from an ArcGIS Server, I think we can choose among ourselves what our preferred file format for working in is. We should definitely use something with a spatial index (GPKG / FlatGeobuf / SHP) rather than GeoJSON.
I agree - Shapefiles always feel very legacy to me. However, this probably comes back to the users again. If the users are mainly running this on individual files before mapping, Shapefiles are probably the best format. If it's mostly being run in other ways, another format might be best. That would also feed into the library choices in point 1. above.
- Sub-module structure: Knowing we're going to have a number of different kinds of checks, and to keep things modular and easy to work with in parallel, this is the workflow / structure I had in mind:
- Checks are broken up into sub-modules based on theme: spatial coverage, spatial nesting, table completeness, table formatting, P-Codes, attribute fields, etc. These are called in ways like
python -m src.spatial_nesting
,python -m src.table_completeness
, etc. There can be a__main__.py
which calls every sub-module in sequence the way I suggested in my branch withmake run
.
I'm happy with this - I just hadn't decided on it yet as I'm still not clear on how it'll be used 🙂 if you know more on the usage and the users then please shout - or feel free to point me towards docs if they exist.
- Each sub-module uses the
metadata.csv
file to iterate through the list of 164 locations and outputs the result of its check to something likedata/spatial_coverage.csv
,data/table_completeness.csv
, etc. The content of these files is a key (iso3
) along with however many columns are needed to add information about the results of the check, whether they're numerical counts, decimal percentages, or boolean pass / fails.- Calculating the final composite will take the
metadata.csv
and join it together with all the individual check CSVs through theiso3
column to make the final result, which we can think about formatting nicely for delivery to the people who will take action from it. I'm thinking something like using pandas to highlight cells in Excel showing issues to be fixed, or making automated PDF reports with Plotly / Jinja2 / WeasyPrint.
I see a bunch of extra file I/O with this plan that might be a little inefficient, but on the flip side might make parallelising the jobs easier. Again, it might depend on where we are running it as to if this architecture is what we want or not.
@Murray2015, those are also all good points. Based on the talks I've had with @vshlemon, there seem to be 2 paths we can take with this project. It might be that both are beneficial and needed.
Live monitoring:
Workflow processing:
Comparing the two, there's much more capacity at the Centre for Humanitarian Data to host this project, and it fits in well with the many data scrapers and pipelines they already manage. There also isn't anything that comprehensively checks the entire catalogue of all CODs currently published, which I feel is a critical gap in the current system. We should have a meeting among ourselves and the two groups above to check in.
I don't want to merge this of my own accord because I think @leonbaruah & @maxmalynowsky would have a better idea of the subject matter and what tests are intended to be run & whether these meet the project requirements.
Also since @maxmalynowsky has downloaded in .gpkg
format how is that going to bode for the tests as these have mostly been written for .shp
files no @Murray2015 ?
It seems to me we might want to decide on who the users will be and how we will direct the project to support them before moving much further?
.gpkg
, .shp
etc.)Perhaps @ddebarros-mapaction & @jduarte-mapaction you could help us roadmap this as it might be the tool is intended for multiple use-cases and execution methods (by pipeline, by individual, with already downloaded files, to download them etc.) and users (ocha, universities, country offices etc.) and this might need us to be more concrete with the features of the tool and how to phase the development (e.g. some users/executions/file-formats/features might be more important, some nice to have but not critical etc.)?
@vshlemon and @maxmalynowsky, I've tracked down answers to these questions. I've posted a summary in Slack, but I will repeat here the bits that are important for dev work:
Goal - give a (subjective) score from 0 - 1 on the quality and reliability of COD Admin Boundary datasets to allow users to choose where to spend time improving datasets.
Output - tests should output a number between 0 and 1 representing their score, with the methodology based on the confluence pages. The overall goal (relevant for running against all datasets) is to provide a traffic light system, as illustrated at the bottom of the confluence page. .csv
might be a good choice.
User - The main user is ITOS. The user will run the code against one, a few, or all COD datasets at once.
How code will be run - this is not fully understood, and MapAction is holding a meeting in the next few weeks to gather requirements on this. The most likely way it will run is in pipelines or scripts. We should also expect some users to run the code manually against datasets in a local environment.
Code deployment, operating systems, etc. - this is still unknown. @maxmalynowsky you might have the best idea of this given the directive that ITOS should be considered the main user
My takeaways/suggestions
This should unblock us enough to move forward and accept most of @maxmalynowsky's initial suggestions in this MR. I'll change this MR to use Geopandas over Pyshape, and this clears us to prefer GPKG over other formats.
Closing this PR as there is a new branch with the refactor
A few thoughts on setting up these checks:
geopandas
overpyshp
. I'm imaging the performance is going to be much better for spatial and table operations, particularly withpyarrow
(gpd.read_file(file, use_arrow=True)
). Complex table syntax is also probably going to be easier with something pandas based rather than iterating through rows. I'm also seeingpyshp
hasn't had a release in over 2 years, which makes it seem like the project might be abandoned now.python -m src.spatial_nesting
,python -m src.table_completeness
, etc. There can be a__main__.py
which calls every sub-module in sequence the way I suggested in my branch withmake run
.metadata.csv
file to iterate through the list of 164 locations and outputs the result of its check to something likedata/spatial_coverage.csv
,data/table_completeness.csv
, etc. The content of these files is a key (iso3
) along with however many columns are needed to add information about the results of the check, whether they're numerical counts, decimal percentages, or boolean pass / fails.metadata.csv
and join it together with all the individual check CSVs through theiso3
column to make the final result, which we can think about formatting nicely for delivery to the people who will take action from it. I'm thinking something like using pandas to highlight cells in Excel showing issues to be fixed, or making automated PDF reports with Plotly / Jinja2 / WeasyPrint.