pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.32k stars 17.8k forks source link

`read_html` crash when table is empty #26205

Closed Frost-Lee closed 5 years ago

Frost-Lee commented 5 years ago

Code Sample, a copy-pastable example if possible

foo = pd.read_html('<table><tbody><tr></tr></tbody></table>')    # Crashed
print(foo)

Problem description

pd.read_html would crash if the input table is an empty table, as the sample code indicates. The crash output is as follows:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-441-efc751c3c2c9> in <module>
----> 1 pd.read_html('<table><tbody><tr></tr></tbody></table>')

/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pandas/io/html.py in read_html(io, match, flavor, header, index_col, skiprows, attrs, parse_dates, tupleize_cols, thousands, encoding, decimal, converters, na_values, keep_default_na, displayed_only)
   1092                   decimal=decimal, converters=converters, na_values=na_values,
   1093                   keep_default_na=keep_default_na,
-> 1094                   displayed_only=displayed_only)

/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pandas/io/html.py in _parse(flavor, io, match, attrs, encoding, displayed_only, **kwargs)
    914             break
    915     else:
--> 916         raise_with_traceback(retained)
    917 
    918     ret = []

/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pandas/compat/__init__.py in raise_with_traceback(exc, traceback)
    418         if traceback == Ellipsis:
    419             _, _, traceback = sys.exc_info()
--> 420         raise exc.with_traceback(traceback)
    421 else:
    422     # this version of raise is a syntax error in Python 3

ValueError: No tables found matching pattern '.+'

From instinct, even if the table might be empty, it is not supposed to crash pd.read_html, returning an empty data frame might be a better way.

Expected Output

foo = pd.read_html('<table><tbody><tr></tr></tbody></table>')
print(foo)    # Empty dataframe

Output of pd.show_versions()

INSTALLED VERSIONS ------------------ commit: None pandas: 0.24.2 pytest: None pip: 18.1 setuptools: 40.6.3 Cython: None numpy: 1.14.5 scipy: 1.1.0 pyarrow: None xarray: None IPython: 7.1.1 sphinx: None patsy: 0.5.1 dateutil: 2.7.5 pytz: 2018.7 blosc: None bottleneck: None tables: 3.4.4 numexpr: 2.6.9 feather: None matplotlib: 3.0.1 openpyxl: None xlrd: 1.1.0 xlwt: 1.3.0 xlsxwriter: None lxml.etree: 4.2.5 bs4: 4.6.3 html5lib: 1.0.1 sqlalchemy: None pymysql: None psycopg2: None jinja2: 2.10 s3fs: None fastparquet: None pandas_gbq: None pandas_datareader: None gcsfs: None
WillAyd commented 5 years ago

Maybe though what is the definition of an empty table in HTML? Should it also have a td element? Why is tr provided here?

If you have a simple way of making it work universally would certainly take a PR

simonjayhawkins commented 5 years ago

i don't think this is necessarily an issue...

Python 3.7.3 (default, Mar 27 2019, 17:13:21) [MSC v.1915 64 bit (AMD64)] :: Anaconda, Inc. on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import pandas as pd
>>> pd.__version__
'0.25.0.dev0+455.gd74901b1a'
>>> pd.read_html('<table><tbody><tr></tr></tbody></table>', match='.*')
[]
>>>

the default for match is to return tables with text containing non-empty string...

    match : str or compiled regular expression, optional
        The set of tables containing text matching this regex or string will be
        returned. Unless the HTML is extremely simple you will probably need to
        pass a non-empty string here. Defaults to '.+' (match any non-empty
        string). The default value will return all tables contained on a page.
        This value is converted to a regular expression so that there is
        consistent behavior between Beautiful Soup and lxml.

read_html is not crashing, it is raising. this is more pythonic than returning an empty list by default.

WillAyd commented 5 years ago

Thanks for the input @simonjayhawkins - I think this makes sense to close