Open Fredrik-M opened 5 days ago
@Fredrik-M The issue you've encountered seems to relate to how pandas.read_html
handles HTML tables that contain only whitespace within their cells. According to the read_html
documentation, the function should always return a list of DataFrame objects or fail
You can try this code snippet for better testing of code:-
import pandas as pd
from io import StringIO
table = '<table><tr><td> </td></tr></table>'
res = pd.read_html(StringIO(table), flavor='lxml')
print(len(res)) # This should print 1
print(res[0]) # This should print the DataFrame containing the space character
When processing an HTML table
with a space character in a <td> tag
, pandas.read_html
should either:
Expected Behavior
According to the read_html docstring
:
Hope this helps.. plz comment if this issue persists Thanks
@Fredrik-M it seems that the read_html function has no flag for skip_blank_lines = True or False, the parser defaults say that it is true. Thus when you add a bunch of space it skips those as blank lines and thus shows up an empty array.
Moreover in the HTMLParser code which parses HTML data elements there is a specific condition which strips whitespaces from a line thus a string with spaces is reduced to an empty string and passed downstream.
take
Pandas version checks
[X] I have checked that this issue has not already been reported.
[X] I have confirmed this bug exists on the latest version of pandas.
[ ] I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
Issue Description
From the
read_html
docstring:It has something to do with the space in the
<td>
tag in the example. Removing the space causes the function to fail instead.Expected Behavior
The function should either fail, or return a list containing a
DataFrame
representing a 1x1 table (either empty or containing the space character in its only cell). Don't know which is more appropriate.Installed Versions