nestauk / dap_prinz_green_jobs

Identifying green occupations/skills/industries in job adverts
MIT License
3 stars 0 forks source link

Fix GJE double quotes issue #130

Closed lizgzil closed 4 months ago

lizgzil commented 7 months ago

In gje_formatting.py we have a step to convert all single quotes to double - which is needed in the GJE.

The code is

for col_name in [
        "top_5_socs",
        "top_5_green_skills",
        "top_5_not_green_skills",
        "top_5_sics",
        "top_5_itl2_quotient",
        "top_5_similar_occs",
    ]:
        occ_agg_extra_loaded[col_name] = occ_agg_extra_loaded[col_name].str.replace(
            "'", '"'
        )

but this causes issues when there is a single quote within the text, so we get e.g. "Manufacture of other builders" carpentry and joinery".

Note: this isnt as simple as only replacing at the start and end because

top_5_socs looks something like the following (and is a string):

"['Gardening', 'Manufacture of other builders' carpentry and joinery', 'something else', 'something else', 'something else']"