sfu-db / dataprep

Open-source low code data preparation library in python. Collect, clean and visualization your data in python with a few lines of code.
http://dataprep.ai
MIT License
2.03k stars 204 forks source link

error using clean_lat_lon #907

Open arfriedman opened 2 years ago

arfriedman commented 2 years ago

Unfortunately, clean_lat_lon returns an error in dataprep 0.4.3a1 with python 3.10.4.

The problem occurs for me using the documentation example:

import pandas as pd
import numpy as np
df = pd.DataFrame({
    "lat_long":
    [(41.5, -81.0), "41.5;-81.0", "41.5,-81.0", "41.5 -81.0",
     "41.5° N, 81.0° W", "41.5 S;81.0 E", "-41.5 S;81.0 E",
     "23 26m 22s N 23 27m 30s E", "23 26' 22\" N 23 27' 30\" E",
     "UT: N 39°20' 0'' / W 74°35' 0''", "hello", np.nan, "NULL"]
})
from dataprep.clean import clean_lat_long
clean_lat_long(df, "lat_long")

It returns this error:

File ~/miniconda3/envs/AQUATIC/lib/python3.10/site-packages/dataprep/clean/clean_lat_long.py:172, in clean_lat_long(df, lat_long, lat_col, long_col, output_format, split, inplace, errors, report, progress)
    167     raise ValueError(
    168         f'output_format {output_format} is invalid, it must be "dd", "ddh", "dm", or "dms"'
    169     )
    171 # convert to dask
--> 172 df = to_dask(df)
    174 # To clean, create a new column "clean_code_tup" which contains
    175 # the cleaned values and code indicating how the initial value was
    176 # changed in a tuple. Then split the column of tuples and count the
    177 # amount of different codes to produce the report
    178 if lat_long:
    179     # clean a latitude and longitude column

File ~/miniconda3/envs/AQUATIC/lib/python3.10/site-packages/dataprep/clean/utils.py:73, in to_dask(df)
     71 df_size = df.memory_usage(deep=True).sum()
     72 npartitions = np.ceil(df_size / 128 / 1024 / 1024)  # 128 MB partition size
---> 73 return dd.from_pandas(df, npartitions=npartitions)

File ~/miniconda3/envs/AQUATIC/lib/python3.10/site-packages/dask/dataframe/io/io.py:236, in from_pandas(data, npartitions, chunksize, sort, name)
    234 if none_chunksize:
    235     if not isinstance(npartitions, int):
--> 236         raise TypeError(
    237             "Please provide npartitions as an int, or possibly as None if you specify chunksize."
    238         )
    239     chunksize = int(ceil(nrows / npartitions))
    240 elif not isinstance(chunksize, int):

TypeError: Please provide npartitions as an int, or possibly as None if you specify chunksize.

I encounter the problem both in the version from conda-forge on linux and also pip on windows.

Thanks much, Andrew

qidanrui commented 2 years ago

Hi @arfriedman. Thank you for using our library and reporting the issue. Actually, others also encountered the similar issue in #903 and give the solution in stackoverflow (https://stackoverflow.com/questions/72453608/dataprep-eda-typeerror-please-provide-npartitions-as-an-int-or-possibly-as-non), and we already refined this issue in current develop branch. You can install the develop branch version with: pip install -U git+https://github.com/sfu-db/dataprep.git@develop Both way can solve the issue you encountered.

qidanrui commented 2 years ago

Hi @arfriedman. Thank you for using our library and reporting the issue. Actually, others also encountered the similar issue in #903 and give the solution in stackoverflow (https://stackoverflow.com/questions/72453608/dataprep-eda-typeerror-please-provide-npartitions-as-an-int-or-possibly-as-non), and we already refined this issue in current develop branch. You can install the develop branch version with: pip install -U git+https://github.com/sfu-db/dataprep.git@develop Both way can solve the issue you encountered.

arfriedman commented 2 years ago

Thank you @qidanrui

arfriedman commented 2 years ago

I installed dataprep 0.4.5, which solves the problem above -- thank you! However, it now returns the following warning:

import pandas as pd
from dataprep.clean import clean_lat_long
df = pd.DataFrame({'coord': ['51° 29′ 36.24″ N, 0° 0′ 35.28″ E', '51.4934° N, 0.0098° E']})
clean_lat_long(df, 'coord', split=True)

/home/andrew/miniconda3/envs/AQUATIC/lib/python3.10/site-packages/dask/dataframe/core.py:6604: FutureWarning: Meta is not valid, `map_partitions` and `map_overlap` expects output to be a pandas object. Try passing a pandas object as meta or a dict or tuple representing the (name, dtype) of the columns. In the future the meta you passed will not work.
  warnings.warn(
Latitude and Longitude Cleaning Report:
        2 values cleaned (100.0%)
Result contains 2 (100.0%) values in the correct format and 0 null values (0.0%)
Out[5]:
                              coord  latitude  longitude
0  51° 29′ 36.24″ N, 0° 0′ 35.28″ E   51.4934     0.0098
1             51.4934° N, 0.0098° E   51.4934     0.0098

Do you know how to address this warming?