pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
42.62k stars 17.58k forks source link

ENH: The parameter `date_format` in `read_csv` #58463

Open BdR76 opened 2 months ago

BdR76 commented 2 months ago

Feature Type

Problem Description

The read_csv has a parameter date_format which can can be "str or dict of columns", see documentation.

So for parsing date columns you can either use:

However, in practise a csv file with different datetime formats, usually it's just either a date or a datetime (or time) format. In other words the date format can differ a lot for different csv files, but usually it doesn't differ that much within one file. Theoretically there could be US and European date formats mixed in one csv file, but I work with a lot of csv data and I've never seen this. From my expericence this is a very uncommon use-case.

Feature Description

So for example, a csv file can have 10 date columns formatted like 01-05-2024 and 5 columns formatted like 05-05-2024 12:30. Reading such a csv file with read_csv with many datetime columns, just the str parameter is not sufficient but the dict parameter is a bit overkill because you have to explicitly set the format for each column when basically there are just two groups, so it's not very practical.

So my feature request is:

Can the read_csv be updated so that the date_format parameter also accepts just a list of dateformat strings for the date columns? So for example date_format=['%d-%m-%Y', '%d-%m-%Y %H:%M:%S']

Alternative Solutions

Alternatively, I think it could be practical for most typical use-cases to give groups of dateformats. So instead of having to supply a parameter for each individual column, like this:

date_formats = {'incdate': '%d-%m-%Y %H:%M:%S',
             'dob': '%d-%m-%Y',
             'visitdate': '%d-%m-%Y %H:%M:%S',
             'ther_startdat1': '%d-%m-%Y',
             'ther_startdat2': '%d-%m-%Y',
             'ther_startdat3': '%d-%m-%Y',
             'ther_startdat4': '%d-%m-%Y',
             'ther_startdat5': '%d-%m-%Y',
             'ther_uptake1': '%H:%M:%S',
             'ther_uptake2': '%H:%M:%S',
             'ther_uptake3': '%H:%M:%S',
             'ther_uptake4': '%H:%M:%S',
             'ther_uptake5': '%H:%M:%S'}

It could be changed so you have to supply groups like this, which is less code and more reflecting the actual situation:

date_formats = {'%d-%m-%Y %H:%M:%S': ['incdate', 'visitdate'],
             '%d-%m-%Y': ['dob', 'ther_startdat1', 'ther_startdat2', 'ther_startdat3', 'ther_startdat4', 'ther_startdat5'],
             '%H:%M:%S': ['dob', 'ther_uptake1', 'ther_uptake2', 'ther_uptake3', 'ther_uptake4', 'ther_uptake5']
}

Additional Context

See code examples below for typical csv files with date values (it is all randomly generated test data)

# Library
import io
import pandas as pd

# load data hard coded, for eaxmple
str_cardio = """TestDate;Protocol;CustomId;SubjectId;BirthDate;Gender;TestStage;Duration;AvgHeartRate;AvgLoad
22-4-2024 17:31:13;Long HIIT 55W 45 sec.;;75110500;26-1-1970;Male;Warmup;151;85,1;26,5
22-4-2024 17:31:13;Long HIIT 55W 45 sec.;;75110500;26-1-1970;Male;Training;1352;113,8;52,6
22-4-2024 17:31:13;Long HIIT 55W 45 sec.;;75110500;26-1-1970;Male;Recovery;256;86,2;5,4
19-4-2024 16:17:32;Long HIIT 130W 45 sec.;;85775845;19-4-2024;Male;Warmup;110;64,9;64,3
19-4-2024 16:17:32;Long HIIT 130W 45 sec.;;85775845;19-4-2024;Male;Training;2485;144,3;128,4
19-4-2024 16:17:32;Long HIIT 130W 45 sec.;;85775845;19-4-2024;Male;Recovery;265;101,0;12,8
17-4-2024 11:58:03;ONCO Interval 70- 90 (3.00-4.00);;23251624;12-11-1969;Female;Warmup;108;53,2;38,8
17-4-2024 11:58:03;ONCO Interval 70- 90 (3.00-4.00);;23251624;12-11-1969;Female;Training;948;115,0;77,6
17-4-2024 11:58:03;ONCO Interval 70- 90 (3.00-4.00);;23251624;12-11-1969;Female;Recovery;235;74,5;7,8
"""

# no date time hints, TestDate and BirthDate imported as string `object`
#df_cardio = pd.read_csv(io.StringIO(str_cardio), sep = ";", decimal=",")
#print(df_cardio.dtypes)

# datetime columns
col_dates = ['TestDate', 'BirthDate']
date_formats = {'TestDate': '%d-%m-%Y %H:%M:%S',
             'BirthDate': '%d-%m-%Y'}

# read csv file
#df_cardio = pd.read_csv(io.StringIO(str_cardio), sep=';', decimal=',', header=0, parse_dates=col_dates, date_format=date_formats)
#f_cardio = pd.read_csv(io.StringIO(str_cardio), sep=';', decimal=',', header=0, parse_dates=col_dates, date_format=['%d-%m-%Y %H:%M:%S', '%H:%M:%S']) # alternative?
df_cardio = pd.read_csv(io.StringIO(str_cardio), sep=';', decimal=',', header=0, parse_dates=col_dates, dayfirst=True)
print(df_cardio.dtypes)
print(df_cardio.TestDate)
print(df_cardio.BirthDate)

# load data hard coded, for eaxmple
str_therapy = """patid;incdate;dob;sex;visitdate;length;weight;thercode;ther_startdat1;ther_startdat2;ther_startdat3;ther_startdat4;ther_startdat5;ther_uptake1;ther_uptake2;ther_uptake3;ther_uptake4;ther_uptake5
6682;5-5-2023;16-09-1973;Male;5-5-2023 13:30:00;172;51.1;C680;12-6-2023;;;;;0:02:28;;;;
3785;30-3-2024;15-03-1994;Male;30-3-2024 11:15:00;176;;C370;25-4-2024;20-5-2024;;;;0:08:32;0:02:48;;;
9392;28-2-2024;26-01-1983;Male;28-2-2024 15:00:00;187;81.1;A930;1-4-2024;28-5-2024;20-6-2024;;;0:23:53;0:48:59;0:47:47;;
1430;29-7-2023;23-11-1976;Female;29-7-2023 17:15:00;170;61.2;C140;28-8-2023;;;;;0:11:23;;;;
2001;2-12-2022;24-11-1998;Male;2-12-2022 14:00:00;148;62;C200;14-1-2023;13-2-2023;;;;0:33:26;0:27:27;;;
1081;27-3-2023;23-09-1990;Male;27-3-2023 14:30:00;149;73.9;C180;1-4-2023;;;;;0:34:16;;;;
9886;20-2-2023;05-12-1992;Male;20-2-2023 11:30:00;160;83.7;A980;2-3-2023;10-4-2023;10-6-2023;;;0:12:43;0:05:12;0:23:19;;
7291;10-5-2023;01-07-1999;Male;10-5-2023 12:30:00;157;63.7;C720;4-6-2023;;;;;0:58:17;;;;
7851;19-8-2023;09-09-1987;Male;19-8-2023 09:15:00;178;92.9;C780;30-9-2023;13-10-2023;;;;0:02:01;0:38:51;;;
5878;11-4-2023;15-11-1961;Female;11-4-2023 15:00:00;154;81.9;C580;8-5-2023;;;;;0:46:43;;;;
"""

# no date time hints, TestDate and BirthDate imported as string `object`
#df_therapy = pd.read_csv(io.StringIO(str_therapy), sep = ";", decimal=",")
#print(df_therapy.dtypes)

# datetime columns
col_dates = ['incdate', 'dob', 'visitdate', 'ther_startdat1', 'ther_startdat2', 'ther_startdat3', 'ther_startdat4', 'ther_startdat5', 'ther_uptake1', 'ther_uptake2', 'ther_uptake3', 'ther_uptake4', 'ther_uptake5']
date_formats = {'incdate': '%d-%m-%Y %H:%M:%S',
             'dob': '%d-%m-%Y',
             'visitdate': '%d-%m-%Y %H:%M:%S',
             'ther_startdat1': '%d-%m-%Y',
             'ther_startdat2': '%d-%m-%Y',
             'ther_startdat3': '%d-%m-%Y',
             'ther_startdat4': '%d-%m-%Y',
             'ther_startdat5': '%d-%m-%Y',
             'ther_uptake1': '%H:%M:%S',
             'ther_uptake2': '%H:%M:%S',
             'ther_uptake3': '%H:%M:%S',
             'ther_uptake4': '%H:%M:%S',
             'ther_uptake5': '%H:%M:%S'}

# read csv file
df_therapy = pd.read_csv(io.StringIO(str_therapy), sep=';', decimal=',', header=0, parse_dates=col_dates, date_format=date_formats)
#df_therapy = pd.read_csv(io.StringIO(str_therapy), sep=';', decimal=',', header=0, parse_dates=col_dates, date_format=['%d-%m-%Y', '%d-%m-%Y %H:%M:%S', '%H:%M:%S']) # alternative?
#df_therapy = pd.read_csv(io.StringIO(str_therapy), sep=';', decimal=',', header=0, parse_dates=col_dates, dayfirst=True)
print(df_therapy.dtypes)

print(df_therapy.ther_startdat2)
print(df_therapy.ther_uptake2)
Aloqeely commented 2 months ago

I am ok with this idea, but I think this can be accomplished in multiple ways without affecting practicality, so I'm not sure if this is necessary.

As for the alternative idea, I'm not really a fan of it, that dict format is not very readable in my opinion, and with just a few lines of code you can convert it to the appropriate dict format.