sfu-db / dataprep

Open-source low code data preparation library in python. Collect, clean and visualization your data in python with a few lines of code.
http://dataprep.ai
MIT License
2.01k stars 204 forks source link

clean_date() throws ValueError #749

Closed FabianPalmaPando closed 2 years ago

FabianPalmaPando commented 2 years ago

Running the following 3 line code using clean_date() throws a ValueError:

df=pd.DataFrame({'date':['1996.07.10 AD at 15:08:56 PDT Thu aug 30 10:36:28 2003', 'Thu Sep 25 10:36:28 BRST 2003', 'a', 1]}) clean=clean_date(df, 'date') ValueError: invalid literal for int() with base 10: 'a'

qidanrui commented 2 years ago

Hi! Thank you for reporting this issue! I think it is because the initial checking didn't consider the single characters which cannot be split. We will fix this bug. Thank you~

moreaupascal56 commented 2 years ago

Hello @qidanrui I took some time to look about this, the error come from the fact that "a" is in the AM flags AM = ["AM", "am", "a"] (clean_date_utils).

Indeed it is not an error because 'a' is a single charachter string because the function is working if you try with another 1 char string ('b' for example) in this case the function def clean_date (link) will return unknown and clean_date() returns NaN

AS "a" is in the AM flag list, the date is considered cleaned in def check_date (link) which result in error because 'a' is a string and not an integer in ensure_pm function where hms_tokens parameter will be in this case hms_tokens=['a'] which is causing the Error line 784 parsed_data.set_hour(int(hms_tokens[0]) + offset).

By the way you can reproduce the error by putting as input date every AM or PM flag (for example just "AM" as an input date).

I have think about some idea to fix but I am not convinced: