Closed Frost-Lee closed 5 years ago
Maybe though what is the definition of an empty table in HTML? Should it also have a td element? Why is tr provided here?
If you have a simple way of making it work universally would certainly take a PR
i don't think this is necessarily an issue...
Python 3.7.3 (default, Mar 27 2019, 17:13:21) [MSC v.1915 64 bit (AMD64)] :: Anaconda, Inc. on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import pandas as pd
>>> pd.__version__
'0.25.0.dev0+455.gd74901b1a'
>>> pd.read_html('<table><tbody><tr></tr></tbody></table>', match='.*')
[]
>>>
the default for match
is to return tables with text containing non-empty string...
match : str or compiled regular expression, optional
The set of tables containing text matching this regex or string will be
returned. Unless the HTML is extremely simple you will probably need to
pass a non-empty string here. Defaults to '.+' (match any non-empty
string). The default value will return all tables contained on a page.
This value is converted to a regular expression so that there is
consistent behavior between Beautiful Soup and lxml.
read_html
is not crashing, it is raising. this is more pythonic than returning an empty list by default.
Thanks for the input @simonjayhawkins - I think this makes sense to close
Code Sample, a copy-pastable example if possible
Problem description
pd.read_html
would crash if the input table is an empty table, as the sample code indicates. The crash output is as follows:From instinct, even if the table might be empty, it is not supposed to crash
pd.read_html
, returning an empty data frame might be a better way.Expected Output
Output of
pd.show_versions()