72: prior to this PR, missing UPRN was filled with the BUILDING_REFERENCE_NUMBER but only for England and Wales, hence UPRN had missing values for Scotland. This original enhancement was in script asf_core_data/getter/epc/epc_data.py. I removed the enhancement from that script, so that the raw data has missing UPRN. We instead now use BUILDING_REFERENCE_NUMBER together with address information (ADDRESS1, ADDRESS2 and POSTCODE) to enhance UPRN during the processing pipeline - changes made to asf_core_data/pipeline/preprocessing/feature_engineering.py. Hence, now EPC processed/processed and deduplicated should have no missing UPRN.
86: I think this has already been fixed by Solomon in asf_daps so no need to change it on your side if loading Scotland data works for you. But in asf_core_data we have problems when loading Scotland data due to column name changes and multiple column headers - so Julia implemented a fix in load_scotland_data() in asf_core_data/getter/epc/epc_data.py and in asf_core_data/config/base_config.py. These changes haven't been committed before, so I'm committing them now.
closes #72
closes #86
Instructions for Reviewer(s)
Review
Hey @Jack-Vines and @sqr00t , thanks for reviewing these scripts. Could one of you please double check if the logic looks okay for the changes in asf_core_data/pipeline/preprocessing/feature_engineering.py? Thanks a lot!
Setup
In case you want to run anything (no need to!!!):
clone this repo: git clone git@github.com:nestauk/asf_core_data.git
checkout to the correct branch: git checkout 72_fix_property_id_and_epc_loading
Run make install;
Run direnv allow;
Activate the conda environment: conda activate asf_core_data;
Run python asf_core_data/pipeline/preprocessing/preprocess_epc_data.py (this will load data from S3, so might take some time). If you have raw data in a local folder you can do python asf_core_data/pipeline/preprocessing/preprocess_epc_data.py --path_to_data "YOUR/LOCAL/PATH/TO/DATA. Note that, either way, this will only store the processed data locally.
Description
This PR fixes:
72: prior to this PR, missing
UPRN
was filled with theBUILDING_REFERENCE_NUMBER
but only for England and Wales, henceUPRN
had missing values for Scotland. This original enhancement was in scriptasf_core_data/getter/epc/epc_data.py
. I removed the enhancement from that script, so that the raw data has missingUPRN
. We instead now useBUILDING_REFERENCE_NUMBER
together with address information (ADDRESS1
,ADDRESS2
andPOSTCODE
) to enhanceUPRN
during the processing pipeline - changes made toasf_core_data/pipeline/preprocessing/feature_engineering.py
. Hence, now EPC processed/processed and deduplicated should have no missingUPRN
.86: I think this has already been fixed by Solomon in asf_daps so no need to change it on your side if loading Scotland data works for you. But in asf_core_data we have problems when loading Scotland data due to column name changes and multiple column headers - so Julia implemented a fix in
load_scotland_data()
inasf_core_data/getter/epc/epc_data.py
and inasf_core_data/config/base_config.py
. These changes haven't been committed before, so I'm committing them now.closes #72 closes #86
Instructions for Reviewer(s)
Review
Hey @Jack-Vines and @sqr00t , thanks for reviewing these scripts. Could one of you please double check if the logic looks okay for the changes in
asf_core_data/pipeline/preprocessing/feature_engineering.py
? Thanks a lot!Setup
In case you want to run anything (no need to!!!):
git clone git@github.com:nestauk/asf_core_data.git
git checkout 72_fix_property_id_and_epc_loading
make install
;direnv allow
;conda activate asf_core_data
;Run
python asf_core_data/pipeline/preprocessing/preprocess_epc_data.py
(this will load data from S3, so might take some time). If you have raw data in a local folder you can dopython asf_core_data/pipeline/preprocessing/preprocess_epc_data.py --path_to_data "YOUR/LOCAL/PATH/TO/DATA
. Note that, either way, this will only store the processed data locally.