meghdadFar / wordview

A Python package for Exploratory Data Analysis (EDA) for text-based data.
MIT License
11 stars 1 forks source link

Bug Report: Sometimes, MWEs are wrong or misspelled phrases #138

Open meghdadFar opened 6 months ago

meghdadFar commented 6 months ago

Description

In many cases, when the corpus contains misspelled or foreign words and phrases, top MWEs end up being those very rare misspelled expressions. This is a known problem when measuring PMI.

To Reproduce

Steps to reproduce the behavior: Simply run MWE extraction and check the results.

Expected behavior

Top MWE results should be common expressions consisting of correct words.

Examples

Light Verb Constructions: LOCK THE DOOOOR

Possible Solutions

The proposed solution is to check the components of MWEs against a lexicon of the selected language to ensure they are actual words and not made-up words.