sfu-db / dataprep

Open-source low code data preparation library in python. Collect, clean and visualization your data in python with a few lines of code.
http://dataprep.ai
MIT License
2.01k stars 204 forks source link

clean_country with fuzzy_dist >= 2 converts None to Niue #785

Closed villekr closed 2 years ago

villekr commented 2 years ago

Describe the bug clean_country interprets None value to Niue when passing 2 (or higher) as fuzzy_dist argument.

To Reproduce

df = pd.DataFrame({"id": [1,2,3,4,5], "country": ["United States","Kanada", "Fimland",np.nan,None]})
df = clean_country(
    df=df,
    column="country",
    input_format="auto",
    output_format="name",
    fuzzy_dist=2,
    strict=False,
    inplace=False,
    errors="coerce",
    report=True,
    progress=True
)
df
id country country_clean
1 United States United States
2 Kanada Canada
3 Fimland Finland
4 NaN NaN
5 None Niue

Expected behavior clean_country should return NaN for None-values no matter what fuzzy_dist value is. If I set fuzzy_dist to 0 or 1 then None will be NaN.

Screenshots If applicable, add screenshots to help explain your problem.

Desktop (please complete the following information):

Additional context None