pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
42.62k stars 17.57k forks source link

BUG: read_html returns empty list #59147

Open Fredrik-M opened 5 days ago

Fredrik-M commented 5 days ago

Pandas version checks

Reproducible Example

import pandas
from _io import StringIO

table = '<table><tr><td> </td></tr></table>'
res = pandas.read_html(StringIO(table), flavor='lxml')
print(len(res))

Issue Description

From the read_html docstring:

This function will always return a list of :class:DataFrame or it will fail, i.e., it will not return an empty list.

It has something to do with the space in the <td> tag in the example. Removing the space causes the function to fail instead.

Expected Behavior

The function should either fail, or return a list containing a DataFrame representing a 1x1 table (either empty or containing the space character in its only cell). Don't know which is more appropriate.

Installed Versions

INSTALLED VERSIONS ------------------ commit : d9cdd2ee5a58015ef6f4d15c7226110c9aab8140 python : 3.9.19.final.0 python-bits : 64 OS : Linux OS-release : 5.10.0-30-amd64 Version : #1 SMP Debian 5.10.218-1 (2024-06-01) machine : x86_64 processor : byteorder : little LC_ALL : None LANG : en_US.UTF-8 LOCALE : en_US.UTF-8 pandas : 2.2.2 numpy : 1.24.1 pytz : 2024.1 dateutil : 2.8.2 setuptools : 69.5.1 pip : 24.0 Cython : None pytest : None hypothesis : None sphinx : None blosc : None feather : None xlsxwriter : None lxml.etree : 5.2.2 html5lib : None pymysql : None psycopg2 : 2.9.9 jinja2 : 3.1.4 IPython : None pandas_datareader : None adbc-driver-postgresql: None adbc-driver-sqlite : None bs4 : 4.12.3 bottleneck : None dataframe-api-compat : None fastparquet : None fsspec : 2024.5.0 gcsfs : None matplotlib : 3.8.4 numba : None numexpr : None odfpy : None openpyxl : None pandas_gbq : None pyarrow : None pyreadstat : None python-calamine : None pyxlsb : None s3fs : None scipy : 1.13.0 sqlalchemy : 2.0.30 tables : None tabulate : None xarray : None xlrd : None zstandard : 0.22.0 tzdata : 2024.1 qtpy : None pyqt5 : None
Siddharth-Latthe-07 commented 5 days ago

@Fredrik-M The issue you've encountered seems to relate to how pandas.read_html handles HTML tables that contain only whitespace within their cells. According to the read_html documentation, the function should always return a list of DataFrame objects or fail You can try this code snippet for better testing of code:-

import pandas as pd
from io import StringIO

table = '<table><tr><td> </td></tr></table>'
res = pd.read_html(StringIO(table), flavor='lxml')
print(len(res))  # This should print 1
print(res[0])    # This should print the DataFrame containing the space character

When processing an HTML table with a space character in a <td> tag, pandas.read_html should either:

  1. Return a list containing a DataFrame that represents a 1x1 table with the space character. OR
    1. Fail gracefully with an appropriate error message.

Expected Behavior According to the read_html docstring:

  1. The function should always return a list of DataFrame objects or fail.
  2. In this specific case, the function should return a list containing a DataFrame that represents a 1x1 table with the space character in its only cell.

Hope this helps.. plz comment if this issue persists Thanks

ritwizsinha commented 4 days ago

@Fredrik-M it seems that the read_html function has no flag for skip_blank_lines = True or False, the parser defaults say that it is true. Thus when you add a bunch of space it skips those as blank lines and thus shows up an empty array.

Moreover in the HTMLParser code which parses HTML data elements there is a specific condition which strips whitespaces from a line thus a string with spaces is reduced to an empty string and passed downstream.

ritwizsinha commented 4 days ago

take