meghdadFar / wordview

A Python package for Exploratory Data Analysis (EDA) for text-based data.
MIT License
11 stars 1 forks source link

Bug Report: Sometimes, MWEs are uncommon phrases #139

Closed meghdadFar closed 6 months ago

meghdadFar commented 6 months ago

Description

In many cases, the extracted top MWEs are very uncommon. That's because both the MWE and some or all of their components have a very low frequency leading the PMI to be large.

To Reproduce

Steps to reproduce the behavior: Simply run MWE extraction and check the results.

Expected behavior

Top MWE results should be common expressions not very rare and unknown.

Examples

whip these ninjas

Possible Solutions

Add a frequency threshold (as a parameter) that defaults to 1. MWE candidates that were observed below this threshold are discarded.