nestauk / industrial_taxonomy

Refactor of nestauk/industrial-taxonomy which upon completion will replace it.
MIT License
3 stars 0 forks source link

17 Process Glass descriptions into tokenised n-grams #25

Closed bishax closed 2 years ago

bishax commented 2 years ago

Closes #17


Checklist:

bishax commented 2 years ago

@Juan-Mateos Addressed comments and rebased onto latest 15_glass_house

Changes: https://github.com/nestauk/industrial_taxonomy/pull/25/files/6bc140ee9248f1b37173d50806568175875a22a9..59356e7735d54ba4859bf72601d974a91df84f80

N.B. Set a full run 2343 going after rebasing just to be sure (check status with python industrial_taxonomy/pipeline/glass_description_ngrams/nlp_flow.py --environment=conda batch list)

bishax commented 2 years ago
* [ ]  I tested `get_description_tokens()` and it works. I was however surprised to find that it returns 17818 4-grams whereas the instructions in the `README` say that are only extracting bigrams and trigrams. Is this a typo in the `README`?

Yes it appears to return 481 unique "4"-grams - this is due to the fact that Spacy merges entities into a single token and then we also perform statistical n-gramming (for n=3).

You can see this from the set of 4-grams:

{'Air_Source_Heat_Pumps',
 'Area_Outstanding_Natural_Beauty',
 'BS_EN_ISO_9001',
 'BS_EN_ISO_9001:2008',
 'Business_Improvement_District_BID',
 'CARDINAL_g_CARDINAL_g',
 'CARDINAL_seater_CARDINAL_seater',
 'CARDINAL_t_CARDINAL_t',
 'Car_Showroom_Cleaning_pre',
 'Certificate_booking_ATOL_protect',
 'Cognitive_Behavioural_Therapy_CBT',
 'Conduct_package_flight_plus',
 'Construction_Skills_Certification_Scheme',
 'Crazy_Paving_slabbing_Hot',
 'DATE_Foundation_Stage_Curriculum',
 'Director_PERSON_accept_invitation',
 'Early_Years_Foundation_Stage',
 'Energy_Performance_Certificates_epc',
 'FRN_CARDINAL_finance_subject',
 'GPE_GPE_Home_Counties',
 'Gas_Safe_OFTEC_register',
 'Gas_Safe_register_plumber',
 'Grade_1_list_building',
 'Grade_2_list_building',
 'Grade_CARDINAL_list_building',
 'Grade_II_list_building',
 'Headquartered_GPE_office_GPE',
 'High_Net_Worth_Individuals',
 'James_Place_Wealth_Management',
 'LOC_GPE_Home_Counties',
 'MOT_testing_car_servicing',
 'MOT_testing_servicing_repair',
 'Managing_Director_DATE_experience',
 'Managing_Director_Mr_PERSON',
 'Managing_Director_PERSON_PERSON',
 'Managing_Director_PERSON_say',
 'Method_statement_Risk_Assessments',
 'Method_statement_risk_assessment',
 'Mr._PERSON_DATE_experience',
 'Mr._PERSON_Mr._PERSON',
 'Mr_PERSON_Managing_Director',
 'Mr_PERSON_Mr_PERSON',
 'NORP_Sign_Language_BSL',
 'ORDINAL_form_academy_status',
 'ORDINAL_hand_car_van',
 'ORDINAL_tier_NORP_football',
 'Owner_PERSON_DATE_experience',
 'PVCu_window_door_conservatory',
 'Pizza_constantly_strive_improve',
 'Place_Wealth_Management_Group',
 'Public_Liability_Insurance_MONEY',
 'Queen_Elizabeth_Olympic_Park',
 'Regulatory_Reform_Fire_Safety',
 'Risk_Assessments_Method_Statements',
 'Risk_Assessments_Method_statement',
 'Senior_Partner_Practice_St.',
 'Sir_PERSON_PERSON_PERSON',
 'Sir_PERSON_Sir_PERSON',
 'South_Downs_National_Park',
 'TIME_day_DATE_week',
 'Window_Cleaning_Kitchen_Cleaning',
 'accept_credit_debit_card',
 'accept_major_credit_card',
 'accept_major_credit_debit',
 'accept_major_debit_credit',
 'accountancy_service_small_medium',
 'accurate_valuation_take_account',
 'act_credit_broker_lender',
 'addition_traditional_auditing_accounting',
 'adhere_strict_code_conduct',
 'air_conditioning_heat_pump',
 'air_conditioning_heating_ventilation',
 'air_conditioning_refrigeration_ventilation',
 'air_freight_sea_freight',
 'air_source_heat_pump',
 'aluminium_window_door_curtain',
 'ample_free_car_parking',
 'annual_turnover_excess_MONEY',
 'answer_question_look_forward',
 'answer_question_smile_question',
 'anti_wrinkle_injection_dermal',
 'area_Outstanding_Natural_Beauty',
 'assign_dedicated_account_manager',
 'attention_detail_customer_satisfaction',
 'attention_detail_quality_workmanship',
 'attention_detail_set_apart',
 'attention_detail_work_closely',
 'authorise_regulate_ORG_FCA',
 'base_GPE_Home_Counties',
 'base_GPE_super_mare',
 'befit_reputation_commit_provide',
 'blue_chip_multi_national',
 'boiler_installation_central_heating',
 'boiler_repair_boiler_servicing',
 'boiler_repair_central_heating',
 'boiler_replacement_central_heating',
 'boost_confidence_self_esteem',
 'breakfast_lunch_TIME_tea',
 'broad_discussion_concern_aspiration',
 'btec_programme_study_pupil',
 'build_confidence_self_esteem',
 'build_long_term_mutually',
 'build_long_term_relationship',
 'build_strong_long_last',
 'buyer_seller_landlord_tenant',
 'buying_sell_let_rent',
 'buying_sell_rent_let',
 'capital_gain_tax_inheritance',
 'car_suit_budget_lifestyle',
 'car_suit_sure_update',
 'carbon_steel_stainless_steel',
 'carefully_select_credit_provider',
 'carpet_carpet_tile_vinyl',
 'carpet_cleaning_upholstery_cleaning',
 'central_heating_installation_boiler',
 'ceramic_porcelain_natural_stone',
 'charge_£_MONEY_hour',
 'chauffeur_drive_car_hire',
 'civil_engineering_groundwork_contractor',
 'civil_structural_engineering_consultancy',
 'co_educational_day_boarding',
 'coffee_lunch_TIME_tea',
 'collect_process_personal_datum',
 'come_recommendation_word_mouth',
 'comment_question_feel_free',
 'commercial_refrigeration_air_conditioning',
 'commit_ensure_privacy_protect',
 'commit_safeguard_promote_welfare',
 'competitively_price_high_quality',
 'completely_free_charge_obligation',
 'completeness_accuracy_reliability_suitability',
 'comply_current_Building_Regulations',
 'comply_current_building_regulation',
 'comply_late_building_regulation',
 'confidence_peace_mind_showroom',
 'consistently_deliver_rock_solid',
 'constantly_look_way_improve',
 'contain_website_purpose_reliance',
 'contract_hire_car_leasing',
 'conveniently_locate_TIME_walk',
 'cover_area_QUANTITY_radius',
 'credit_broker_lender_authorise',
 'credit_debit_card_payment',
 'crime_anti_social_behaviour',
 'crème_de_la_crème',
 'curtain_blind_soft_furnishing',
 'datum_transmission_internet_inherently',
 'decision_simplify_compliance_proactively',
 'delicious_option_deliver_straight',
 'designate_Area_Outstanding_Natural',
 'designate_area_outstanding_natural',
 'disability_age_sexual_orientation',
 'dolor_sit_amet_PERSON',
 'domestic_commercial_industrial_agricultural',
 'domestic_commercial_industrial_electrical',
 'domestic_commercial_plumbing_heating',
 'double_bedroom_en_suite',
 'double_glaze_window_door',
 'double_glazed_window_door',
 'double_glazing_triple_glazing',
 'double_glazing_window_door',
 'drastically_reduce_energy_bill',
 'early_stage_start_up',
 'earn_enviable_reputation_dependable',
 'easily_accessible_road_rail',
 'easy_access_major_motorway',
 'easy_use_quote_generator',
 'efficient_cost_effective_manner',
 'electric_vehicle_charge_point',
 'emergency_lighting_fire_alarm',
 'en_suite_shower_room',
 'encounter_problem_unable_resolve',
 'engagement_ring_wedding_ring',
 'engineer_fully_qualified_Gas',
 'environmentally_friendly_cost_effective',
 'error_occur_check_salesperson',
 'establish_DATE_PERSON_Managing',
 'establish_DATE_build_enviable',
 'establish_DATE_current_Managing',
 'establish_DATE_grow_steadily',
 'establish_DATE_husband_wife',
 'estate_agent_let_agent',
 'explain_thing_plain_English',
 'extension_loft_conversion_renovation',
 'extra_mean_buy_car',
 'extra_virgin_olive_oil',
 'extremely_proud_AA_Dealer',
 'favourite_dish_delicious_option',
 'ferrous_non_ferrous_material',
 'ferrous_non_ferrous_metal',
 'ferrous_non_ferrous_scrap',
 'finance_package_suit_pocket',
 'finance_subject_status_income',
 'fire_alarm_emergency_lighting',
 'fire_alarm_intruder_alarm',
 'fire_extinguisher_fire_alarm',
 'flat_screen_tv_tea',
 'flight_flight_inclusive_holiday',
 'forge_long_last_relationship',
 'forge_long_term_relationship',
 'forge_strong_working_relationship',
 'form_DATE_current_Managing',
 'form_DATE_follow_merger',
 'form_DATE_grow_steadily',
 'form_DATE_husband_wife',
 'found_DATE_Managing_Director',
 'found_DATE_PERSON_Managing',
 'found_DATE_PERSON_son',
 'found_DATE_current_Managing',
 'found_DATE_grow_steadily',
 'found_DATE_husband_wife',
 'friendly_staff_extra_mile',
 'fully_aware_datum_privacy',
 'fully_insure_Gas_Safe',
 'fully_insure_public_liability',
 'fully_qualified_Gas_Safe',
 'fully_qualified_fully_insure',
 'fully_qualified_tree_surgeon',
 'garage_door_roller_shutter',
 'gas_central_heating_plumbing',
 'gas_oil_central_heating',
 'general_accountancy_audit_tax',
 'gluten_free_dairy_free',
 'government_department_local_authority',
 'grammar_school_academy_status',
 'graphic_designer_web_developer',
 'grass_cutting_hedge_cutting',
 'hard_landscaping_soft_landscaping',
 'hard_work_work_ethic',
 'hazardous_non_hazardous_waste',
 'heating_engineer_Gas_Safe',
 'heating_ventilation_air_conditioning',
 'hedge_fund_private_equity',
 'hesitate_contact_look_forward',
 'high_level_customer_satisfaction',
 'high_pressure_sale_tactic',
 'high_pressure_water_jet',
 'high_pressure_water_jetting',
 'high_quality_cost_effective',
 'high_quality_hot_tub',
 'high_quality_long_lasting',
 'high_quality_pvc_u',
 'high_quality_self_adhesive',
 'high_standard_advantage_entrust',
 'high_standard_minimum_fuss',
 'home_repossess_repayment_mortgage',
 'hope_find_website_informative',
 'hot_tub_swim_spa',
 'hot_water_central_heating',
 'housing_association_local_authority',
 'huge_range_premium_mid',
 'inclusion_link_necessarily_imply',
 'increase_confidence_self_esteem',
 'independent_financial_adviser_IFAs',
 'indirect_consequential_loss_damage',
 'indoor_heated_swimming_pool',
 'indoor_outdoor_swimming_pool',
 'inform_immediately_problem_arise',
 'initial_consultation_free_charge',
 'instant_online_quote_booking',
 'insurance_company_loss_adjuster',
 'intend_diagnose_treat_cure',
 'interior_exterior_painting_decorate',
 'intruder_alarm_access_control',
 'intruder_alarm_fire_alarm',
 'landlord_gas_safety_certificate',
 'law_government_legislation_attention',
 'lay_Tarmac_red_black',
 'leisure_centre_swimming_pool',
 'limited_company_sole_trader',
 'local_authority_housing_association',
 'locate_GPE_order_online',
 'loft_conversion_extension_refurbishment',
 'loft_conversion_garage_conversion',
 'loft_conversion_home_extension',
 'loft_conversion_house_extension',
 'loft_conversion_kitchen_bathroom',
 'loft_conversion_new_build',
 'long_term_mutually_beneficial',
 'lose_sight_premise_prosperity',
 'loss_damage_include_limitation',
 'loss_damage_whatsoever_arise',
 'loss_datum_profit_arise',
 'major_credit_debit_card',
 'majority_work_come_recommendation',
 'majority_work_come_repeat',
 'majority_work_come_word',
 'majority_work_word_mouth',
 'male_female_driving_instructor',
 'marital_status_sexual_orientation',
 'means_site_earn_advertising',
 'mental_health_learn_disability',
 'mental_health_service_user',
 'method_statement_risk_assessment',
 'mild_steel_aluminium_stainless',
 'mild_steel_stainless_steel',
 'modern_slavery_human_trafficking',
 'mortgage_advice_time_buyer',
 'mortgage_insurance_consumer_credit',
 'mortgage_protection_commit_put',
 'multi_fuel_wood_burn',
 'multi_million_pound_turnover',
 'multi_national_blue_chip',
 'multi_storey_car_park',
 'mutually_beneficial_long_term',
 'mutually_beneficial_relationship_client',
 'mutually_beneficial_working_relationship',
 'new_build_extension_loft',
 'non_ferrous_scrap_metal',
 'non_surgical_facial_aesthetic',
 'notice_privacy_notice_fair',
 'notice_supplement_notice_intend',
 'oblige_maintain_high_standard',
 'operational_efficiency_whilst_simultaneously',
 'order_online_favourite_dish',
 'original_equipment_manufacturer_oem',
 'own_run_husband_wife',
 'patent_trade_mark_attorney',
 'peace_mind_Gas_Safe',
 'peace_mind_fully_insure',
 'peace_mind_job_big',
 'peace_mind_landlord_tenant',
 'peace_mind_surprise_bill',
 'peace_mind_value_money',
 'peace_mind_work_carry',
 'pension_investment_fall_rise',
 'pension_protection_commit_put',
 'perfect_place_relax_unwind',
 'personal_injury_clinical_negligence',
 'personal_service_job_attend',
 'personalized_service_competitive_rate',
 'physical_electronic_managerial_procedure',
 'physical_mental_emotional_spiritual',
 'planning_stage_finished_article',
 'positive_word_mouth_repeat',
 'possible_road_perfect_vehicle',
 'post_traumatic_stress_disorder',
 'precise_depend_circumstance_estimate',
 'precision_sheet_metal_fabrication',
 'prevent_loss_misuse_alteration',
 'privately_own_company_base',
 'privately_own_company_specialise',
 'privately_own_family_run',
 'proactive_manner_help_succeed',
 'professional_service_high_caliber',
 'professional_service_high_calibre',
 'promote_equality_challenge_discrimination',
 'promotional_product_look_forward',
 'prove_track_record_deliver',
 'prove_track_record_successfully',
 'public_employer_liability_insurance',
 'public_liability_employer_liability',
 'purpose_build_QUANTITY_ft',
 'pvc_u_window_door',
 'quality_service_customer_satisfaction',
 'real_estate_private_equity',
 'realise_buy_car_daunting',
 'recommendation_endorse_view_express',
 'recycle_PERCENT_waste_collect',
 'reduce_waste_go_landfill',
 'refrigeration_air_conditioning_equipment',
 'refurbishment_extension_loft_conversion',
 'register_charity_company_limit',
 'regulate_ORG_Commissioner_OISC',
 'regulate_ORG_respect_regulated',
 'regulatory_regime_restrict_consumer',
 'renewable_energy_energy_efficiency',
 'rent_review_lease_renewal',
 'repayment_MORTGAGE_debt_secure',
 'repayment_mortgage_debt_secure',
 'repeat_business_long_term',
 'repeat_business_word_mouth',
 'replacement_window_door_conservatory',
 'request_act_credit_broker',
 'residential_commercial_sale_letting',
 'respect_collaborative_fresh_determined',
 'rest_assure_safe_hand',
 'retirement_planning_inheritance_tax',
 'risk_assessment_method_statement',
 'satisfied_client_attest_superior',
 'save_money_energy_bill',
 'save_money_reduce_carbon',
 'seamless_communication_budgeting_staffing',
 'search_engine_optimisation_seo',
 'secondary_school_ORDINAL_form',
 'secondary_school_academy_status',
 'self_assessment_tax_return',
 'self_cater_holiday_cottage',
 'self_catering_holiday_accommodation',
 'self_catering_holiday_cottage',
 'self_confidence_self_esteem',
 'self_employ_sole_trader',
 'sell_car_possible_road',
 'seller_buyer_landlord_tenant',
 'seller_landlord_buyer_tenant',
 'seo_search_engine_optimisation',
 'shopping_centre_retail_park',
 'short_long_term_assignment',
 'short_long_term_basis',
 'short_long_term_rental',
 'short_medium_long_term',
 'short_term_long_term',
 'showroom_open_DATE_week',
 'significantly_reduce_carbon_footprint',
 'sized_business_sole_trader',
 'small_business_sole_trader',
 'small_business_start_up',
 'small_group_like_minded',
 'small_medium_sized_enterprise',
 'sole_trader_multi_national',
 'sole_trader_partnership_limited',
 'sole_trader_self_employ',
 'sole_trader_small_medium',
 'son_PERSON_Managing_Director',
 'stainless_steel_aluminium_mild',
 'stainless_steel_exhaust_system',
 'stainless_steel_mild_steel',
 'standard_BS_EN_ISO',
 'state_art_QUANTITY_ft',
 'status_income_write_quotation',
 'steel_stainless_steel_aluminium',
 'stick_deadline_quality_workmanship',
 'stocklist_regularly_worth_give',
 'strict_health_safety_policy',
 'structured_cabling_fibre_optic',
 'sub_contractor_main_contractor',
 'suit_sure_update_stocklist',
 'surround_area_QUANTITY_radius',
 'swimming_pool_hot_tub',
 't_shirt_polo_shirt',
 'tax_advice_help_achieve',
 'tea_coffee_making_facility',
 'thank_visit_look_forward',
 'thing_little_bit_differently',
 'tier_NORP_football_league',
 'timely_cost_effective_manner',
 'touch_arrange_accurate_valuation',
 'traditional_value_strive_forefront',
 'traffic_light_turn_leave',
 'trust_mutual_respect_collaborative',
 'turnover_excess_£_MONEY',
 'tyre_brand_suit_pocket',
 'tyre_fit_payment_normal',
 'ultra_high_net_worth',
 'unfortunately_datum_accurate_valuation',
 'unlikely_event_go_wrong',
 'update_page_check_page',
 'upmost_professionalism_high_standard',
 'upstream_oil_gas_industry',
 'upvc_window_door_conservatory',
 'utmost_care_attention_detail',
 'utmost_professionalism_high_standard',
 'valuable_strategic_insight_direction',
 'value_diversity_promote_equality',
 'vehicle_vehicle_maintenance_friendly',
 'venture_capital_private_equity',
 'voluntary_community_social_enterprise',
 'voluntary_organisation_register_charity',
 'warranty_kind_express_imply',
 'washing_machine_tumble_dryer',
 'wedding_dress_bridesmaid_dress',
 'wide_range_blue_chip',
 'wide_range_extra_curricular',
 'window_door_conservatory_orangery',
 'window_door_conservatory_porch',
 'window_door_conservatory_roofline',
 'window_door_curtain_walling',
 'window_door_porch_conservatory',
 'window_door_suddenly_focal',
 'woefully_outdated_inefficient_use',
 'wood_burning_multi_fuel',
 'working_environment_foster_continuous',
 'worth_give_look_website',
 'write_quotation_request_act',
 'wrought_iron_gate_railing',
 'young_people_age_16',
 '£_500_£_MONEY',
 '£_MONEY_+_vat',
 '£_MONEY_plus_vat',
 '£_MONEY_public_liability',
 '£_MONEY_£_MONEY'}