Closed jowens closed 6 years ago
Thanks for the detailed issue! I think handling rowspan
and colspan
correctly would be a welcome enhancement, agree it could basically work like Excel, either tuple-izing or creating a MultiIndex
. Appreciate a PR if you're interested.
duplicate of #14267, but I'll close that one.
@chris-b1 my last effort to provide a PR was pretty much a debacle, so I'm probably not your guy. That being said, since this does seem to be a topic of interest, a little guidance as to how it could be done would help either me or anyone else provide a PR (e.g.: "should probably start with this function"). I don't actually know if this is something that should be "fixed" in pandas or through pandas's setup of the underlying parser(s).
I haven't done anything with the read_html
code, but my understanding is it works like excel, with 3 overall steps:
TextParser
which is generic logic that actually converts the data into a DataFrame
In this case, what most likely needs done is modifying step 2 in the presence of rowspan
/colspan
, adjusting the data. Can look to read_excel
for inspiration, or a simple example below - key things are the padding of data and header
keyword. (index_col
works the same for index)
In [8]: from pandas.io.parsers import TextParser
In [14]: df = TextParser([
...: ['a', 'a', 'b'],
...: ['sub1', 'sub2', 'sub2'],
...: [1, 2, 3],
...: [4, 5, 6],
...: ],
...: header=[0, 1]).read()
In [16]: df
Out[16]:
a b
sub1 sub2 sub2
0 1 2 3
1 4 5 6
In [17]: df.columns
Out[17]:
MultiIndex(levels=[['a', 'b'], ['sub1', 'sub2']],
labels=[[0, 0, 1], [0, 1, 1]])
FYI: All relevant logic appears to be in io/html.py
in the function _HtmlFrameParser:_parse_raw_thead
; it does not rely on the parser chosen.
... although there is no current capability for the parser to get attributes (e.g., rowspan
, colspan
) from elements, so that must be added. (There's currently a text_getter
that returns a string; we need an analogous attrs_getter
that returns a dict with keys=attributes, values=attribute_values.)
@chris-b1 would you mind eyeballing the following output for the 4 tables on this web page: https://www.ssa.gov/policy/docs/statcomps/supplement/2015/5h.html? This seems to me to be the right pieces to pass to TextParser
(as long as I'm returning this from _parse_raw_thead
, everything else ought to just work fine):
++ Returning this from _HtmlFrameParser:_parse_raw_thead:
[[u'Year', u'Retired-worker families', u'Retired-worker families', u'Retired-worker families', u'Retired-worker families', u'Survivor families', u'Survivor families', u'Survivor families', u'Survivor families', u'Disabled-worker families', u'Disabled-worker families', u'Disabled-worker families', u'Disabled-worker families', u'Disabled-worker families', u'Disabled-worker families'], [u'Year', u'Worker only', u'Worker only', u'Worker only', u'Worker and wife\xa0a', u'Non-disabled widow only', u'Widowed mother or father and\u2014', u'Widowed mother or father and\u2014', u'Widowed mother or father and\u2014', u'Worker only', u'Worker only', u'Worker only', u'Worker, wife,\xa0b and\u2014', u'Worker, wife,\xa0b and\u2014', u'Worker and spouse'], [u'Year', u'All', u'Men', u'Women', u'Worker and wife\xa0a', u'Non-disabled widow only', u'1\xa0child', u'2\xa0children', u'3 or more children', u'All', u'Men', u'Women', u'1\xa0child', u'2 or more children', u'Worker and spouse']]
++ Returning this from _HtmlFrameParser:_parse_raw_thead:
[[u'Family group', u'Number (thousands)', u'Number (thousands)', u'Average primary insurance amount (dollars)', u'Average monthly family benefit (dollars)'], [u'Family group', u'Families', u'Beneficiaries', u'Average primary insurance amount (dollars)', u'Average monthly family benefit (dollars)']]
++ Returning this from _HtmlFrameParser:_parse_raw_thead:
[[u'Monthly family benefit\xa0a (dollars)', u'Retired worker only', u'Retired worker only', u'Retired worker and wife', u'Retired worker, wife, and\u2014', u'Retired worker, wife, and\u2014', u'Disabled worker only', u'Disabled worker only', u'Disabled worker, wife, and\u2014', u'Disabled worker, wife, and\u2014'], [u'Monthly family benefit\xa0a (dollars)', u'Men', u'Women', u'Retired worker and wife', u'1\xa0child', u'2 or more children', u'Men', u'Women', u'1\xa0child', u'2 or more children']]
++ Returning this from _HtmlFrameParser:_parse_raw_thead:
[[u'Monthly family benefit (dollars)', u'Widowed mother or father and\u2014', u'Widowed mother or father and\u2014', u'Widowed mother or father and\u2014', u'Children only', u'Children only', u'Children only', u'Widow only', u'Widow only'], [u'Monthly family benefit (dollars)', u'1\xa0child', u'2\xa0children', u'3 or more children', u'1\xa0child', u'2\xa0children', u'3 or more children', u'Nondisabled', u'Disabled']]
Here's the current output (from trunk).
++ Returning this from _HtmlFrameParser:_parse_raw_thead:
[[u'Year', u'Retired-worker families', u'Survivor families', u'Disabled-worker families'], [u'Worker only', u'Worker and wife\xa0a', u'Non-disabled widow only', u'Widowed mother or father and\u2014', u'Worker only', u'Worker, wife,\xa0b and\u2014', u'Worker and spouse'], [u'All', u'Men', u'Women', u'1\xa0child', u'2\xa0children', u'3 or more children', u'All', u'Men', u'Women', u'1\xa0child', u'2 or more children']]
++ Returning this from _HtmlFrameParser:_parse_raw_thead:
[[u'Family group', u'Number (thousands)', u'Average primary insurance amount (dollars)', u'Average monthly family benefit (dollars)'], [u'Families', u'Beneficiaries']]
++ Returning this from _HtmlFrameParser:_parse_raw_thead:
[[u'Monthly family benefit\xa0a (dollars)', u'Retired worker only', u'Retired worker and wife', u'Retired worker, wife, and\u2014', u'Disabled worker only', u'Disabled worker, wife, and\u2014'], [u'Men', u'Women', u'1\xa0child', u'2 or more children', u'Men', u'Women', u'1\xa0child', u'2 or more children']]
++ Returning this from _HtmlFrameParser:_parse_raw_thead:
[[u'Monthly family benefit (dollars)', u'Widowed mother or father and\u2014', u'Children only', u'Widow only'], [u'1\xa0child', u'2\xa0children', u'3 or more children', u'1\xa0child', u'2\xa0children', u'3 or more children', u'Nondisabled', u'Disabled']]
Yeah, at a quick glance that's looking good!
There is a larger structural problem with the code in that currently, the parsing is divided into three pieces—parse_thead
, parse_tbody
, and parse_tfoot
, each of which has its own custom logic. My current code is focused in parse_thead
, where I thought it would be most relevant. However, (a) rowspan and colspan certainly can appear in the body and foot and (b) Wikipedia tables don't have a <thead>
at all and so everything gets dumped into the body. So I think—lacking global knowledge about doing a big refactoring like this—that it might be better to have one chunk of code that does all parsing (hopefully in the generic parser code, not in the parser-specific parser code) and that special cases for header and footer might be in that one chunk of code. But this sort of refactoring is likely beyond what I could do well.
cc: some of the folks who have recently edited this file for comment/advice: @jreback @brianhuey @gte620v @jorisvandenbossche @hnykda @mjsu @cpcloud
(For posterity: A lot of the reason that I see there's different pieces for head, body, and foot is basically for flexibility on HTML tables: there might or might not be a head or foot, the body might or might not be declared with <tbody>
, etc. (For Wikipedia tables, no <thead>
but rows with <th>
and not <td>
means we should probably interpret those rows as header rows.) But, there's no documentation as far as I can tell to say, basically, these are the different styles of tables that pandas supports. The conditionals in the parse routines aren't commented so I'm just guessing on which different table behaviors they're handling. Hopefully the current test cases are comprehensive enough to cover 'em.)
xref discussion in #17073 : it will be addressed when this issue gets resolved.
From #17074:
@chris-b1 or anyone else, help a brother out? Can you tell me what this test does? It's just expecting the parser to throw an error? The output from the test code (where it's failing) is at the bottom. It's a pretty weird HTML file.
Now, if I call it with my current in-progress code as dfs = pd.read_html('computer_sales_page.html', header=[0, 1])
, I see:
Index([ (u'Unnamed: 0_level_0', u'Unnamed: 0_level_1'),
(u'Unnamed: 1_level_0', u'Unnamed: 1_level_1'),
(u'Three months ended April?30', u'2013'),
u'(u'Three months ended April\xa030', '2013').1',
(u'Three months ended April?30', u'Unnamed: 4_level_1'),
(u'Three months ended April?30', u'2012'),
u'(u'Three months ended April\xa030', '2012').1',
(u'Unnamed: 7_level_0', u'Unnamed: 7_level_1'),
(u'Six months ended April?30', u'2013'),
u'(u'Six months ended April\xa030', '2013').1',
(u'Six months ended April?30', u'Unnamed: 10_level_1'),
(u'Six months ended April?30', u'2012'),
u'(u'Six months ended April\xa030', '2012').1',
(u'Unnamed: 13_level_0', u'Unnamed: 13_level_1')],
dtype='object')
and if I call it without a header argument (dfs = pd.read_html('computer_sales_page.html')
), I see:
Index([ (u'Unnamed: 0_level_0', u'Unnamed: 0_level_1', u'Unnamed: 0_level_2'),
(u'Unnamed: 1_level_0', u'Unnamed: 1_level_1', u'Unnamed: 1_level_2'),
(u'Three months ended April?30', u'2013', u'In millions'),
u'(u'Three months ended April\xa030', '2013', 'In millions').1',
(u'Three months ended April?30', u'Unnamed: 4_level_1', u'In millions'),
(u'Three months ended April?30', u'2012', u'In millions'),
u'(u'Three months ended April\xa030', '2012', 'In millions').1',
(u'Unnamed: 7_level_0', u'Unnamed: 7_level_1', u'In millions'),
(u'Six months ended April?30', u'2013', u'In millions'),
u'(u'Six months ended April\xa030', '2013', 'In millions').1',
(u'Six months ended April?30', u'Unnamed: 10_level_1', u'In millions'),
(u'Six months ended April?30', u'2012', u'In millions'),
u'(u'Six months ended April\xa030', '2012', 'In millions').1',
(u'Unnamed: 13_level_0', u'Unnamed: 13_level_1', u'Unnamed: 13_level_2')],
dtype='object')
These seem like OK outputs to me. I'm not sure what the original test is supposed to show. I think I'd like to just delete the test if it's supposed to fail (and no longer fails).
____________________ TestReadHtml.test_computer_sales_page _____________________
self = <pandas.tests.io.test_html.TestReadHtml object at 0x1120aa390>
def test_computer_sales_page(self):
data = os.path.join(DATA_PATH, 'computer_sales_page.html')
with tm.assert_raises_regex(ParserError,
r"Passed header=\[0,1\] are "
r"too many rows for this "
r"multi_index of columns"):
> self.read_html(data, header=[0, 1])
pandas/tests/io/test_html.py:778:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
self = <pandas.util.testing._AssertRaisesContextmanager object at 0x1120aab50>
exc_type = None, exc_value = None, trace_back = None
def __exit__(self, exc_type, exc_value, trace_back):
expected = self.exception
if not exc_type:
exp_name = getattr(expected, "__name__", str(expected))
> raise AssertionError("{0} not raised.".format(exp_name))
E AssertionError: ParserError not raised.
pandas/util/testing.py:2491: AssertionError
@jowens - can you open a PR with your WIP code? Easier to answer these type of questions that way.
Code Sample, a copy-pastable example if possible
This has complex table headings:
read_html
output begins with:(row 0 of the output is probably something one would have to manually eliminate)
Problem description
For HTML headings with rowspan and colspan elements,
read_html
has undesirable behavior. Basicallyread_html
packs all heading<th>
elements in any particular row to the left, so any particular column no longer has any association with the<th>
elements that are actually above it in the HTML table.Ample discussion here about the analogous pandas+Excel test case: https://github.com/pandas-dev/pandas/issues/4679
Relevant web discussions:
This may be an issue with the underlying parsers and cannot be solved well in pandas. This appears to be the behavior with both lxml and bs4/html5lib.
Expected Output
Each column should be associated with the
<th>
elements above it in the table. This might be a multi-row column name (as it is now) (aMultiIndex
?) or a tuple (presumably if the argumenttupleize_cols
is set toTrue
). Instead, currently, column n is associated with the n th<th>
entry in the table row regardless of the settings of rowspan/colspan.It may be this is possible to do properly in current pandas in which case I apologize for filing the issue (but I'd be happy to know how to do it).
Output of
pd.show_versions()