Closed bishax closed 2 years ago
@Juan-Mateos Addressed comments and rebased onto latest 15_glass_house
N.B. Set a full run 2343
going after rebasing just to be sure (check status with python industrial_taxonomy/pipeline/glass_description_ngrams/nlp_flow.py --environment=conda batch list
)
* [ ] I tested `get_description_tokens()` and it works. I was however surprised to find that it returns 17818 4-grams whereas the instructions in the `README` say that are only extracting bigrams and trigrams. Is this a typo in the `README`?
Yes it appears to return 481
unique "4"-grams - this is due to the fact that Spacy merges entities into a single token and then we also perform statistical n-gramming (for n=3
).
You can see this from the set of 4-grams:
{'Air_Source_Heat_Pumps',
'Area_Outstanding_Natural_Beauty',
'BS_EN_ISO_9001',
'BS_EN_ISO_9001:2008',
'Business_Improvement_District_BID',
'CARDINAL_g_CARDINAL_g',
'CARDINAL_seater_CARDINAL_seater',
'CARDINAL_t_CARDINAL_t',
'Car_Showroom_Cleaning_pre',
'Certificate_booking_ATOL_protect',
'Cognitive_Behavioural_Therapy_CBT',
'Conduct_package_flight_plus',
'Construction_Skills_Certification_Scheme',
'Crazy_Paving_slabbing_Hot',
'DATE_Foundation_Stage_Curriculum',
'Director_PERSON_accept_invitation',
'Early_Years_Foundation_Stage',
'Energy_Performance_Certificates_epc',
'FRN_CARDINAL_finance_subject',
'GPE_GPE_Home_Counties',
'Gas_Safe_OFTEC_register',
'Gas_Safe_register_plumber',
'Grade_1_list_building',
'Grade_2_list_building',
'Grade_CARDINAL_list_building',
'Grade_II_list_building',
'Headquartered_GPE_office_GPE',
'High_Net_Worth_Individuals',
'James_Place_Wealth_Management',
'LOC_GPE_Home_Counties',
'MOT_testing_car_servicing',
'MOT_testing_servicing_repair',
'Managing_Director_DATE_experience',
'Managing_Director_Mr_PERSON',
'Managing_Director_PERSON_PERSON',
'Managing_Director_PERSON_say',
'Method_statement_Risk_Assessments',
'Method_statement_risk_assessment',
'Mr._PERSON_DATE_experience',
'Mr._PERSON_Mr._PERSON',
'Mr_PERSON_Managing_Director',
'Mr_PERSON_Mr_PERSON',
'NORP_Sign_Language_BSL',
'ORDINAL_form_academy_status',
'ORDINAL_hand_car_van',
'ORDINAL_tier_NORP_football',
'Owner_PERSON_DATE_experience',
'PVCu_window_door_conservatory',
'Pizza_constantly_strive_improve',
'Place_Wealth_Management_Group',
'Public_Liability_Insurance_MONEY',
'Queen_Elizabeth_Olympic_Park',
'Regulatory_Reform_Fire_Safety',
'Risk_Assessments_Method_Statements',
'Risk_Assessments_Method_statement',
'Senior_Partner_Practice_St.',
'Sir_PERSON_PERSON_PERSON',
'Sir_PERSON_Sir_PERSON',
'South_Downs_National_Park',
'TIME_day_DATE_week',
'Window_Cleaning_Kitchen_Cleaning',
'accept_credit_debit_card',
'accept_major_credit_card',
'accept_major_credit_debit',
'accept_major_debit_credit',
'accountancy_service_small_medium',
'accurate_valuation_take_account',
'act_credit_broker_lender',
'addition_traditional_auditing_accounting',
'adhere_strict_code_conduct',
'air_conditioning_heat_pump',
'air_conditioning_heating_ventilation',
'air_conditioning_refrigeration_ventilation',
'air_freight_sea_freight',
'air_source_heat_pump',
'aluminium_window_door_curtain',
'ample_free_car_parking',
'annual_turnover_excess_MONEY',
'answer_question_look_forward',
'answer_question_smile_question',
'anti_wrinkle_injection_dermal',
'area_Outstanding_Natural_Beauty',
'assign_dedicated_account_manager',
'attention_detail_customer_satisfaction',
'attention_detail_quality_workmanship',
'attention_detail_set_apart',
'attention_detail_work_closely',
'authorise_regulate_ORG_FCA',
'base_GPE_Home_Counties',
'base_GPE_super_mare',
'befit_reputation_commit_provide',
'blue_chip_multi_national',
'boiler_installation_central_heating',
'boiler_repair_boiler_servicing',
'boiler_repair_central_heating',
'boiler_replacement_central_heating',
'boost_confidence_self_esteem',
'breakfast_lunch_TIME_tea',
'broad_discussion_concern_aspiration',
'btec_programme_study_pupil',
'build_confidence_self_esteem',
'build_long_term_mutually',
'build_long_term_relationship',
'build_strong_long_last',
'buyer_seller_landlord_tenant',
'buying_sell_let_rent',
'buying_sell_rent_let',
'capital_gain_tax_inheritance',
'car_suit_budget_lifestyle',
'car_suit_sure_update',
'carbon_steel_stainless_steel',
'carefully_select_credit_provider',
'carpet_carpet_tile_vinyl',
'carpet_cleaning_upholstery_cleaning',
'central_heating_installation_boiler',
'ceramic_porcelain_natural_stone',
'charge_£_MONEY_hour',
'chauffeur_drive_car_hire',
'civil_engineering_groundwork_contractor',
'civil_structural_engineering_consultancy',
'co_educational_day_boarding',
'coffee_lunch_TIME_tea',
'collect_process_personal_datum',
'come_recommendation_word_mouth',
'comment_question_feel_free',
'commercial_refrigeration_air_conditioning',
'commit_ensure_privacy_protect',
'commit_safeguard_promote_welfare',
'competitively_price_high_quality',
'completely_free_charge_obligation',
'completeness_accuracy_reliability_suitability',
'comply_current_Building_Regulations',
'comply_current_building_regulation',
'comply_late_building_regulation',
'confidence_peace_mind_showroom',
'consistently_deliver_rock_solid',
'constantly_look_way_improve',
'contain_website_purpose_reliance',
'contract_hire_car_leasing',
'conveniently_locate_TIME_walk',
'cover_area_QUANTITY_radius',
'credit_broker_lender_authorise',
'credit_debit_card_payment',
'crime_anti_social_behaviour',
'crème_de_la_crème',
'curtain_blind_soft_furnishing',
'datum_transmission_internet_inherently',
'decision_simplify_compliance_proactively',
'delicious_option_deliver_straight',
'designate_Area_Outstanding_Natural',
'designate_area_outstanding_natural',
'disability_age_sexual_orientation',
'dolor_sit_amet_PERSON',
'domestic_commercial_industrial_agricultural',
'domestic_commercial_industrial_electrical',
'domestic_commercial_plumbing_heating',
'double_bedroom_en_suite',
'double_glaze_window_door',
'double_glazed_window_door',
'double_glazing_triple_glazing',
'double_glazing_window_door',
'drastically_reduce_energy_bill',
'early_stage_start_up',
'earn_enviable_reputation_dependable',
'easily_accessible_road_rail',
'easy_access_major_motorway',
'easy_use_quote_generator',
'efficient_cost_effective_manner',
'electric_vehicle_charge_point',
'emergency_lighting_fire_alarm',
'en_suite_shower_room',
'encounter_problem_unable_resolve',
'engagement_ring_wedding_ring',
'engineer_fully_qualified_Gas',
'environmentally_friendly_cost_effective',
'error_occur_check_salesperson',
'establish_DATE_PERSON_Managing',
'establish_DATE_build_enviable',
'establish_DATE_current_Managing',
'establish_DATE_grow_steadily',
'establish_DATE_husband_wife',
'estate_agent_let_agent',
'explain_thing_plain_English',
'extension_loft_conversion_renovation',
'extra_mean_buy_car',
'extra_virgin_olive_oil',
'extremely_proud_AA_Dealer',
'favourite_dish_delicious_option',
'ferrous_non_ferrous_material',
'ferrous_non_ferrous_metal',
'ferrous_non_ferrous_scrap',
'finance_package_suit_pocket',
'finance_subject_status_income',
'fire_alarm_emergency_lighting',
'fire_alarm_intruder_alarm',
'fire_extinguisher_fire_alarm',
'flat_screen_tv_tea',
'flight_flight_inclusive_holiday',
'forge_long_last_relationship',
'forge_long_term_relationship',
'forge_strong_working_relationship',
'form_DATE_current_Managing',
'form_DATE_follow_merger',
'form_DATE_grow_steadily',
'form_DATE_husband_wife',
'found_DATE_Managing_Director',
'found_DATE_PERSON_Managing',
'found_DATE_PERSON_son',
'found_DATE_current_Managing',
'found_DATE_grow_steadily',
'found_DATE_husband_wife',
'friendly_staff_extra_mile',
'fully_aware_datum_privacy',
'fully_insure_Gas_Safe',
'fully_insure_public_liability',
'fully_qualified_Gas_Safe',
'fully_qualified_fully_insure',
'fully_qualified_tree_surgeon',
'garage_door_roller_shutter',
'gas_central_heating_plumbing',
'gas_oil_central_heating',
'general_accountancy_audit_tax',
'gluten_free_dairy_free',
'government_department_local_authority',
'grammar_school_academy_status',
'graphic_designer_web_developer',
'grass_cutting_hedge_cutting',
'hard_landscaping_soft_landscaping',
'hard_work_work_ethic',
'hazardous_non_hazardous_waste',
'heating_engineer_Gas_Safe',
'heating_ventilation_air_conditioning',
'hedge_fund_private_equity',
'hesitate_contact_look_forward',
'high_level_customer_satisfaction',
'high_pressure_sale_tactic',
'high_pressure_water_jet',
'high_pressure_water_jetting',
'high_quality_cost_effective',
'high_quality_hot_tub',
'high_quality_long_lasting',
'high_quality_pvc_u',
'high_quality_self_adhesive',
'high_standard_advantage_entrust',
'high_standard_minimum_fuss',
'home_repossess_repayment_mortgage',
'hope_find_website_informative',
'hot_tub_swim_spa',
'hot_water_central_heating',
'housing_association_local_authority',
'huge_range_premium_mid',
'inclusion_link_necessarily_imply',
'increase_confidence_self_esteem',
'independent_financial_adviser_IFAs',
'indirect_consequential_loss_damage',
'indoor_heated_swimming_pool',
'indoor_outdoor_swimming_pool',
'inform_immediately_problem_arise',
'initial_consultation_free_charge',
'instant_online_quote_booking',
'insurance_company_loss_adjuster',
'intend_diagnose_treat_cure',
'interior_exterior_painting_decorate',
'intruder_alarm_access_control',
'intruder_alarm_fire_alarm',
'landlord_gas_safety_certificate',
'law_government_legislation_attention',
'lay_Tarmac_red_black',
'leisure_centre_swimming_pool',
'limited_company_sole_trader',
'local_authority_housing_association',
'locate_GPE_order_online',
'loft_conversion_extension_refurbishment',
'loft_conversion_garage_conversion',
'loft_conversion_home_extension',
'loft_conversion_house_extension',
'loft_conversion_kitchen_bathroom',
'loft_conversion_new_build',
'long_term_mutually_beneficial',
'lose_sight_premise_prosperity',
'loss_damage_include_limitation',
'loss_damage_whatsoever_arise',
'loss_datum_profit_arise',
'major_credit_debit_card',
'majority_work_come_recommendation',
'majority_work_come_repeat',
'majority_work_come_word',
'majority_work_word_mouth',
'male_female_driving_instructor',
'marital_status_sexual_orientation',
'means_site_earn_advertising',
'mental_health_learn_disability',
'mental_health_service_user',
'method_statement_risk_assessment',
'mild_steel_aluminium_stainless',
'mild_steel_stainless_steel',
'modern_slavery_human_trafficking',
'mortgage_advice_time_buyer',
'mortgage_insurance_consumer_credit',
'mortgage_protection_commit_put',
'multi_fuel_wood_burn',
'multi_million_pound_turnover',
'multi_national_blue_chip',
'multi_storey_car_park',
'mutually_beneficial_long_term',
'mutually_beneficial_relationship_client',
'mutually_beneficial_working_relationship',
'new_build_extension_loft',
'non_ferrous_scrap_metal',
'non_surgical_facial_aesthetic',
'notice_privacy_notice_fair',
'notice_supplement_notice_intend',
'oblige_maintain_high_standard',
'operational_efficiency_whilst_simultaneously',
'order_online_favourite_dish',
'original_equipment_manufacturer_oem',
'own_run_husband_wife',
'patent_trade_mark_attorney',
'peace_mind_Gas_Safe',
'peace_mind_fully_insure',
'peace_mind_job_big',
'peace_mind_landlord_tenant',
'peace_mind_surprise_bill',
'peace_mind_value_money',
'peace_mind_work_carry',
'pension_investment_fall_rise',
'pension_protection_commit_put',
'perfect_place_relax_unwind',
'personal_injury_clinical_negligence',
'personal_service_job_attend',
'personalized_service_competitive_rate',
'physical_electronic_managerial_procedure',
'physical_mental_emotional_spiritual',
'planning_stage_finished_article',
'positive_word_mouth_repeat',
'possible_road_perfect_vehicle',
'post_traumatic_stress_disorder',
'precise_depend_circumstance_estimate',
'precision_sheet_metal_fabrication',
'prevent_loss_misuse_alteration',
'privately_own_company_base',
'privately_own_company_specialise',
'privately_own_family_run',
'proactive_manner_help_succeed',
'professional_service_high_caliber',
'professional_service_high_calibre',
'promote_equality_challenge_discrimination',
'promotional_product_look_forward',
'prove_track_record_deliver',
'prove_track_record_successfully',
'public_employer_liability_insurance',
'public_liability_employer_liability',
'purpose_build_QUANTITY_ft',
'pvc_u_window_door',
'quality_service_customer_satisfaction',
'real_estate_private_equity',
'realise_buy_car_daunting',
'recommendation_endorse_view_express',
'recycle_PERCENT_waste_collect',
'reduce_waste_go_landfill',
'refrigeration_air_conditioning_equipment',
'refurbishment_extension_loft_conversion',
'register_charity_company_limit',
'regulate_ORG_Commissioner_OISC',
'regulate_ORG_respect_regulated',
'regulatory_regime_restrict_consumer',
'renewable_energy_energy_efficiency',
'rent_review_lease_renewal',
'repayment_MORTGAGE_debt_secure',
'repayment_mortgage_debt_secure',
'repeat_business_long_term',
'repeat_business_word_mouth',
'replacement_window_door_conservatory',
'request_act_credit_broker',
'residential_commercial_sale_letting',
'respect_collaborative_fresh_determined',
'rest_assure_safe_hand',
'retirement_planning_inheritance_tax',
'risk_assessment_method_statement',
'satisfied_client_attest_superior',
'save_money_energy_bill',
'save_money_reduce_carbon',
'seamless_communication_budgeting_staffing',
'search_engine_optimisation_seo',
'secondary_school_ORDINAL_form',
'secondary_school_academy_status',
'self_assessment_tax_return',
'self_cater_holiday_cottage',
'self_catering_holiday_accommodation',
'self_catering_holiday_cottage',
'self_confidence_self_esteem',
'self_employ_sole_trader',
'sell_car_possible_road',
'seller_buyer_landlord_tenant',
'seller_landlord_buyer_tenant',
'seo_search_engine_optimisation',
'shopping_centre_retail_park',
'short_long_term_assignment',
'short_long_term_basis',
'short_long_term_rental',
'short_medium_long_term',
'short_term_long_term',
'showroom_open_DATE_week',
'significantly_reduce_carbon_footprint',
'sized_business_sole_trader',
'small_business_sole_trader',
'small_business_start_up',
'small_group_like_minded',
'small_medium_sized_enterprise',
'sole_trader_multi_national',
'sole_trader_partnership_limited',
'sole_trader_self_employ',
'sole_trader_small_medium',
'son_PERSON_Managing_Director',
'stainless_steel_aluminium_mild',
'stainless_steel_exhaust_system',
'stainless_steel_mild_steel',
'standard_BS_EN_ISO',
'state_art_QUANTITY_ft',
'status_income_write_quotation',
'steel_stainless_steel_aluminium',
'stick_deadline_quality_workmanship',
'stocklist_regularly_worth_give',
'strict_health_safety_policy',
'structured_cabling_fibre_optic',
'sub_contractor_main_contractor',
'suit_sure_update_stocklist',
'surround_area_QUANTITY_radius',
'swimming_pool_hot_tub',
't_shirt_polo_shirt',
'tax_advice_help_achieve',
'tea_coffee_making_facility',
'thank_visit_look_forward',
'thing_little_bit_differently',
'tier_NORP_football_league',
'timely_cost_effective_manner',
'touch_arrange_accurate_valuation',
'traditional_value_strive_forefront',
'traffic_light_turn_leave',
'trust_mutual_respect_collaborative',
'turnover_excess_£_MONEY',
'tyre_brand_suit_pocket',
'tyre_fit_payment_normal',
'ultra_high_net_worth',
'unfortunately_datum_accurate_valuation',
'unlikely_event_go_wrong',
'update_page_check_page',
'upmost_professionalism_high_standard',
'upstream_oil_gas_industry',
'upvc_window_door_conservatory',
'utmost_care_attention_detail',
'utmost_professionalism_high_standard',
'valuable_strategic_insight_direction',
'value_diversity_promote_equality',
'vehicle_vehicle_maintenance_friendly',
'venture_capital_private_equity',
'voluntary_community_social_enterprise',
'voluntary_organisation_register_charity',
'warranty_kind_express_imply',
'washing_machine_tumble_dryer',
'wedding_dress_bridesmaid_dress',
'wide_range_blue_chip',
'wide_range_extra_curricular',
'window_door_conservatory_orangery',
'window_door_conservatory_porch',
'window_door_conservatory_roofline',
'window_door_curtain_walling',
'window_door_porch_conservatory',
'window_door_suddenly_focal',
'woefully_outdated_inefficient_use',
'wood_burning_multi_fuel',
'working_environment_foster_continuous',
'worth_give_look_website',
'write_quotation_request_act',
'wrought_iron_gate_railing',
'young_people_age_16',
'£_500_£_MONEY',
'£_MONEY_+_vat',
'£_MONEY_plus_vat',
'£_MONEY_public_liability',
'£_MONEY_£_MONEY'}
Closes #17
Checklist:
notebooks/
flake8
and addressed any linter erorspre-commit
and addressed any issues not automatically fixeddev
(or merged any new changes fromdev
)README
soutput/reports/