urbanbigdatacentre / ideamaps-models

Models of deprivation sub-domains for the IDEAMAPS data ecosystem project. This repo contains the source code used to run models and the model outputs. It also contains logic to upload model outputs to the IDEAMAPS platform.
MIT License
2 stars 0 forks source link

Write Source Code #14

Closed andymithamclarke closed 2 months ago

andymithamclarke commented 6 months ago

@Gtregon to coordinate the reference data team @Adenikemie + Alex in creating the training dataset for the new morphological infomality model based on the reference data created in #9

The task involves using 3-point reference data to generate training data from the following datasets:

The datasets marked with an ** are new to the modelling process and will require some time to familiarise.

This task can be completed when we have a training dataset for the morphological informality model - and this dataset is referenced from within the Github and is likely stored in an accessible place like CRIB.

Gtregon commented 5 months ago

UPDATE:

The population, building dens and road datasets will be used to generate reference data for the deep learning model. We no longer need to do any additional preparation for the DL model, as the GW team have formalised a workflow that allows any team member to input reference data and generate outputs using their model.

This issue can be closed when the reference data has been forwarded to the GW team (Ryan etc).

Gtregon commented 5 months ago

UPDATE:

WP2/3 team have now agreed that a rule-based model would be the best approach to generate initial high, medium and low classifications of morphological informality within our pilot cities.

@Gtregon will therefore write and develop source code to ingest the reference data developed in #15 #16 #17 and deploy a rule based model.

This issue can be closed when source code has been developed to run a rule based model using the reference datasets.

gielinkg commented 4 months ago

✅ Definition of Done

Gtregon commented 3 months ago

Update: writing of the source code and running of the model will be combined within the same issue as these tasks will be completed simultaneously i.e. as source code is written the step analysis/running will be performed.

Updated set of tasks to be complete within this issue:

  1. @Gtregon to combine all covarite data into one complete dataset. Atm, the covariates exist in gpkg's and are separate to one another. A csv file will be generated the joins all the relevant datasets together so that all covariates are in the same dataset and can be used within the ML model.
  2. Jupyter notebook will be used to develop python source code using Numpy, pandas/geopandas will be used to process the data and scikitlearn will be used to deploy the RF model and calculate shap values.
  3. Reference datasets will be used to generate training and testing datasets for high, medium and low MI. It is likely Kano will be used as the first model due to its higher number of unique samples in reference data (829 unique samples vs 202 in Lagos). 80% of the data will be used for training and 20% for testing. This is usually recommended ot be 70-30 split but due to the limited amount of reference data, a higher quantity will be kept for training.
andymithamclarke commented 2 months ago

Closing as source code is written: