Open csala opened 2 years ago
I can actually confirm that the mentioned arguments work by simply adding them here: https://github.com/pandas-dev/pandas/blob/6122c7de128fce3a84d91ef91b9dc3a914531745/pandas/io/html.py#L1028
And then passing them down here: https://github.com/pandas-dev/pandas/blob/6122c7de128fce3a84d91ef91b9dc3a914531745/pandas/io/html.py#L1204
since, they are automatically added within the generic **kwargs
dict here: https://github.com/pandas-dev/pandas/blob/6122c7de128fce3a84d91ef91b9dc3a914531745/pandas/io/html.py#L975
which is later on pushed down until the corresponding parser reads them.
I'd be happy to make a PR if this is an acceptable change.
thanks for the request
date_format
will be added as part of https://github.com/pandas-dev/pandas/pull/51019, but for the others the advice will likely keep being to parse as object and convert to datetime after that
@MarcoGorelli So after #51019 will the read_csv
and read_html
signatures be aligned? Or will they continue to behave differently.
To be honest, I am not in favor of forcing multiple steps (read data first and then parse datetimes), but my concern was not really that much about the date parsing step, and more about the fact that ingesting data via read_csv
and read_html
had different steps required.
In any case, if this is not going to be addressed, please feel free to close this issue.
Feature Type
[X] Adding new functionality to pandas
[ ] Changing existing functionality in pandas
[ ] Removing existing functionality in pandas
Problem Description
The current
read_html
function exposes the sameparse_dates
argument thatread_csv
has, but it does not expose the rest of arguments that let the user control how the dates are parsed (infer_datetime_format
,keep_date_col
,date_parser
,dayfirst
,cache_dates
).Other arguments unrelated to date parsing may be in the same situation, so maybe this issue could be extended to cover them all.
Feature Description
These arguments, or at least some of them, could be easily exposed directly in the
read_html
without much hassle, which would be very convenient for the user.Alternative Solutions
Right now the only viable solution is to skip date parsing altogether during the data loading step and then manually implement the date parsing over the returned data frame.
The problem with this is that it breaks the API uniformity with
read_csv
, making the implementation of integrations with different input data sources different depending on the data format (function call with arguments vs function call with arguments + postprocessing), while also potentially skipping any optimizations implemented during theread_csv
workflow.Additional Context
From what I could tell skimming over the code, the
read_html
function only adds a few layers of code on top of the underlying parser, which already supports all the mentioned arguments, andparse_dates
is simply pushed down to it untouched letting the parser use the default values for all the others arguments in the list above.