ENH: read_html to handle rowspan, colspan

pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more

https://pandas.pydata.org

BSD 3-Clause "New" or "Revised" License

43.82k stars 17.99k forks source link

ENH: read_html to handle rowspan, colspan #17054

Closed jowens closed 6 years ago

jowens commented 7 years ago

Code Sample, a copy-pastable example if possible

import pandas as pd
pd.read_html('https://www.ssa.gov/policy/docs/statcomps/supplement/2015/5h.html')[0]

This has complex table headings:

annual_statistical_supplement__2015_-_beneficiary_families_with_oasdi_benefits_in_current-payment_status__5_h_

read_html output begins with:

                                                 Year                            Retired-worker families        Survivor families             Disabled-worker families                                Unnamed: 4_level_0                              Unnamed: 5_level_0 Unnamed: 6_level_0 Unnamed: 7_level_0 Unnamed: 8_level_0 Unnamed: 9_level_0 Unnamed: 10_level_0 Unnamed: 11_level_0  \
                                          Worker only                                  Worker and wife?a  Non-disabled widow only        Widowed mother or father and?                                       Worker only                            Worker, wife,?b and?  Worker and spouse Unnamed: 7_level_1 Unnamed: 8_level_1 Unnamed: 9_level_1 Unnamed: 10_level_1 Unnamed: 11_level_1
                                                  All                                                Men                    Women                              1?child                                        2?children                              3 or more children                All                Men              Women            1?child  2 or more children Unnamed: 11_level_2
0                                                 NaN                                 Number?(thousands)                      NaN                                  NaN                                               NaN                                             NaN                NaN                NaN                NaN                NaN                 NaN                 NaN
1                                                1945                                                416                      338                                   78                                               181                                              95              86.00              48.00              24.00              .?.?.               .?.?.               .?.?.

(row 0 of the output is probably something one would have to manually eliminate)

Problem description

For HTML headings with rowspan and colspan elements, read_html has undesirable behavior. Basically read_html packs all heading <th> elements in any particular row to the left, so any particular column no longer has any association with the <th> elements that are actually above it in the HTML table.

Ample discussion here about the analogous pandas+Excel test case: https://github.com/pandas-dev/pandas/issues/4679

Relevant web discussions:

This may be an issue with the underlying parsers and cannot be solved well in pandas. This appears to be the behavior with both lxml and bs4/html5lib.

Expected Output

Each column should be associated with the <th> elements above it in the table. This might be a multi-row column name (as it is now) (a MultiIndex?) or a tuple (presumably if the argument tupleize_cols is set to True). Instead, currently, column n is associated with the n th <th> entry in the table row regardless of the settings of rowspan/colspan.

It may be this is possible to do properly in current pandas in which case I apologize for filing the issue (but I'd be happy to know how to do it).

Output of `pd.show_versions()`

INSTALLED VERSIONS ------------------ commit: None python: 2.7.13.final.0 python-bits: 64 OS: Darwin OS-release: 16.7.0 machine: x86_64 processor: i386 byteorder: little LC_ALL: None LANG: en_US.US-ASCII LOCALE: None.None pandas: 0.20.3 pytest: None pip: 9.0.1 setuptools: 36.2.0 Cython: 0.26 numpy: 1.13.1 scipy: 0.19.1 xarray: None IPython: 5.3.0 sphinx: 1.6.3 patsy: None dateutil: 2.6.0 pytz: 2017.2 blosc: None bottleneck: 1.2.1 tables: 3.4.2 numexpr: 2.6.2 feather: None matplotlib: 2.0.2 openpyxl: 2.4.7 xlrd: 1.0.0 xlwt: None xlsxwriter: None lxml: 3.7.3 bs4: 4.5.3 html5lib: 1.0b10 sqlalchemy: None pymysql: None psycopg2: None jinja2: 2.9.6 s3fs: None pandas_gbq: None pandas_datareader: None

chris-b1 commented 7 years ago

Thanks for the detailed issue! I think handling rowspan and colspan correctly would be a welcome enhancement, agree it could basically work like Excel, either tuple-izing or creating a MultiIndex. Appreciate a PR if you're interested.

duplicate of #14267, but I'll close that one.

jowens commented 7 years ago

@chris-b1 my last effort to provide a PR was pretty much a debacle, so I'm probably not your guy. That being said, since this does seem to be a topic of interest, a little guidance as to how it could be done would help either me or anyone else provide a PR (e.g.: "should probably start with this function"). I don't actually know if this is something that should be "fixed" in pandas or through pandas's setup of the underlying parser(s).

chris-b1 commented 7 years ago

I haven't done anything with the read_html code, but my understanding is it works like excel, with 3 overall steps:

Read from external data source w/ third party package
(Bulk of logic) Convert that data into a list-of-lists data structure representing the rows
That list-of-lists is passed to TextParser which is generic logic that actually converts the data into a DataFrame

In this case, what most likely needs done is modifying step 2 in the presence of rowspan/colspan, adjusting the data. Can look to read_excel for inspiration, or a simple example below - key things are the padding of data and header keyword. (index_col works the same for index)

In [8]: from pandas.io.parsers import TextParser

In [14]: df = TextParser([
    ...:     ['a', 'a', 'b'],
    ...:     ['sub1', 'sub2', 'sub2'],
    ...:     [1, 2, 3],
    ...:     [4, 5, 6],
    ...:     ],
    ...:     header=[0, 1]).read()

In [16]: df
Out[16]: 
     a         b
  sub1 sub2 sub2
0    1    2    3
1    4    5    6

In [17]: df.columns
Out[17]: 
MultiIndex(levels=[['a', 'b'], ['sub1', 'sub2']],
           labels=[[0, 0, 1], [0, 1, 1]])

jowens commented 7 years ago

FYI: All relevant logic appears to be in io/html.py in the function _HtmlFrameParser:_parse_raw_thead; it does not rely on the parser chosen.

jowens commented 7 years ago

... although there is no current capability for the parser to get attributes (e.g., rowspan, colspan) from elements, so that must be added. (There's currently a text_getter that returns a string; we need an analogous attrs_getter that returns a dict with keys=attributes, values=attribute_values.)

jowens commented 7 years ago

@chris-b1 would you mind eyeballing the following output for the 4 tables on this web page: https://www.ssa.gov/policy/docs/statcomps/supplement/2015/5h.html? This seems to me to be the right pieces to pass to TextParser (as long as I'm returning this from _parse_raw_thead, everything else ought to just work fine):

++ Returning this from _HtmlFrameParser:_parse_raw_thead:
[[u'Year', u'Retired-worker families', u'Retired-worker families', u'Retired-worker families', u'Retired-worker families', u'Survivor families', u'Survivor families', u'Survivor families', u'Survivor families', u'Disabled-worker families', u'Disabled-worker families', u'Disabled-worker families', u'Disabled-worker families', u'Disabled-worker families', u'Disabled-worker families'], [u'Year', u'Worker only', u'Worker only', u'Worker only', u'Worker and wife\xa0a', u'Non-disabled widow only', u'Widowed mother or father and\u2014', u'Widowed mother or father and\u2014', u'Widowed mother or father and\u2014', u'Worker only', u'Worker only', u'Worker only', u'Worker, wife,\xa0b and\u2014', u'Worker, wife,\xa0b and\u2014', u'Worker and spouse'], [u'Year', u'All', u'Men', u'Women', u'Worker and wife\xa0a', u'Non-disabled widow only', u'1\xa0child', u'2\xa0children', u'3 or more children', u'All', u'Men', u'Women', u'1\xa0child', u'2 or more children', u'Worker and spouse']]

++ Returning this from _HtmlFrameParser:_parse_raw_thead:
[[u'Family group', u'Number (thousands)', u'Number (thousands)', u'Average primary insurance amount (dollars)', u'Average monthly family benefit (dollars)'], [u'Family group', u'Families', u'Beneficiaries', u'Average primary insurance amount (dollars)', u'Average monthly family benefit (dollars)']]

++ Returning this from _HtmlFrameParser:_parse_raw_thead:
[[u'Monthly family benefit\xa0a (dollars)', u'Retired worker only', u'Retired worker only', u'Retired worker and wife', u'Retired worker, wife, and\u2014', u'Retired worker, wife, and\u2014', u'Disabled worker only', u'Disabled worker only', u'Disabled worker, wife, and\u2014', u'Disabled worker, wife, and\u2014'], [u'Monthly family benefit\xa0a (dollars)', u'Men', u'Women', u'Retired worker and wife', u'1\xa0child', u'2 or more children', u'Men', u'Women', u'1\xa0child', u'2 or more children']]

++ Returning this from _HtmlFrameParser:_parse_raw_thead:
[[u'Monthly family benefit (dollars)', u'Widowed mother or father and\u2014', u'Widowed mother or father and\u2014', u'Widowed mother or father and\u2014', u'Children only', u'Children only', u'Children only', u'Widow only', u'Widow only'], [u'Monthly family benefit (dollars)', u'1\xa0child', u'2\xa0children', u'3 or more children', u'1\xa0child', u'2\xa0children', u'3 or more children', u'Nondisabled', u'Disabled']]

jowens commented 7 years ago

Here's the current output (from trunk).

++ Returning this from _HtmlFrameParser:_parse_raw_thead:
[[u'Year', u'Retired-worker families', u'Survivor families', u'Disabled-worker families'], [u'Worker only', u'Worker and wife\xa0a', u'Non-disabled widow only', u'Widowed mother or father and\u2014', u'Worker only', u'Worker, wife,\xa0b and\u2014', u'Worker and spouse'], [u'All', u'Men', u'Women', u'1\xa0child', u'2\xa0children', u'3 or more children', u'All', u'Men', u'Women', u'1\xa0child', u'2 or more children']]

++ Returning this from _HtmlFrameParser:_parse_raw_thead:
[[u'Family group', u'Number (thousands)', u'Average primary insurance amount (dollars)', u'Average monthly family benefit (dollars)'], [u'Families', u'Beneficiaries']]

++ Returning this from _HtmlFrameParser:_parse_raw_thead:
[[u'Monthly family benefit\xa0a (dollars)', u'Retired worker only', u'Retired worker and wife', u'Retired worker, wife, and\u2014', u'Disabled worker only', u'Disabled worker, wife, and\u2014'], [u'Men', u'Women', u'1\xa0child', u'2 or more children', u'Men', u'Women', u'1\xa0child', u'2 or more children']]

++ Returning this from _HtmlFrameParser:_parse_raw_thead:
[[u'Monthly family benefit (dollars)', u'Widowed mother or father and\u2014', u'Children only', u'Widow only'], [u'1\xa0child', u'2\xa0children', u'3 or more children', u'1\xa0child', u'2\xa0children', u'3 or more children', u'Nondisabled', u'Disabled']]

chris-b1 commented 7 years ago

Yeah, at a quick glance that's looking good!

jowens commented 7 years ago

There is a larger structural problem with the code in that currently, the parsing is divided into three pieces—parse_thead, parse_tbody, and parse_tfoot, each of which has its own custom logic. My current code is focused in parse_thead, where I thought it would be most relevant. However, (a) rowspan and colspan certainly can appear in the body and foot and (b) Wikipedia tables don't have a <thead> at all and so everything gets dumped into the body. So I think—lacking global knowledge about doing a big refactoring like this—that it might be better to have one chunk of code that does all parsing (hopefully in the generic parser code, not in the parser-specific parser code) and that special cases for header and footer might be in that one chunk of code. But this sort of refactoring is likely beyond what I could do well.

cc: some of the folks who have recently edited this file for comment/advice: @jreback @brianhuey @gte620v @jorisvandenbossche @hnykda @mjsu @cpcloud

jowens commented 7 years ago

(For posterity: A lot of the reason that I see there's different pieces for head, body, and foot is basically for flexibility on HTML tables: there might or might not be a head or foot, the body might or might not be declared with <tbody>, etc. (For Wikipedia tables, no <thead> but rows with <th> and not <td> means we should probably interpret those rows as header rows.) But, there's no documentation as far as I can tell to say, basically, these are the different styles of tables that pandas supports. The conditionals in the parse routines aren't commented so I'm just guessing on which different table behaviors they're handling. Hopefully the current test cases are comprehensive enough to cover 'em.)

gfyoung commented 7 years ago

xref discussion in #17073 : it will be addressed when this issue gets resolved.

gfyoung commented 7 years ago

From #17074:

@chris-b1 or anyone else, help a brother out? Can you tell me what this test does? It's just expecting the parser to throw an error? The output from the test code (where it's failing) is at the bottom. It's a pretty weird HTML file.

Now, if I call it with my current in-progress code as dfs = pd.read_html('computer_sales_page.html', header=[0, 1]), I see:

Index([         (u'Unnamed: 0_level_0', u'Unnamed: 0_level_1'),
                (u'Unnamed: 1_level_0', u'Unnamed: 1_level_1'),
                     (u'Three months ended April?30', u'2013'),
              u'(u'Three months ended April\xa030', '2013').1',
       (u'Three months ended April?30', u'Unnamed: 4_level_1'),
                     (u'Three months ended April?30', u'2012'),
              u'(u'Three months ended April\xa030', '2012').1',
                (u'Unnamed: 7_level_0', u'Unnamed: 7_level_1'),
                       (u'Six months ended April?30', u'2013'),
                u'(u'Six months ended April\xa030', '2013').1',
        (u'Six months ended April?30', u'Unnamed: 10_level_1'),
                       (u'Six months ended April?30', u'2012'),
                u'(u'Six months ended April\xa030', '2012').1',
              (u'Unnamed: 13_level_0', u'Unnamed: 13_level_1')],
      dtype='object')

and if I call it without a header argument (dfs = pd.read_html('computer_sales_page.html')), I see:

Index([   (u'Unnamed: 0_level_0', u'Unnamed: 0_level_1', u'Unnamed: 0_level_2'),
          (u'Unnamed: 1_level_0', u'Unnamed: 1_level_1', u'Unnamed: 1_level_2'),
                      (u'Three months ended April?30', u'2013', u'In millions'),
                u'(u'Three months ended April\xa030', '2013', 'In millions').1',
        (u'Three months ended April?30', u'Unnamed: 4_level_1', u'In millions'),
                      (u'Three months ended April?30', u'2012', u'In millions'),
                u'(u'Three months ended April\xa030', '2012', 'In millions').1',
                 (u'Unnamed: 7_level_0', u'Unnamed: 7_level_1', u'In millions'),
                        (u'Six months ended April?30', u'2013', u'In millions'),
                  u'(u'Six months ended April\xa030', '2013', 'In millions').1',
         (u'Six months ended April?30', u'Unnamed: 10_level_1', u'In millions'),
                        (u'Six months ended April?30', u'2012', u'In millions'),
                  u'(u'Six months ended April\xa030', '2012', 'In millions').1',
       (u'Unnamed: 13_level_0', u'Unnamed: 13_level_1', u'Unnamed: 13_level_2')],
      dtype='object')

These seem like OK outputs to me. I'm not sure what the original test is supposed to show. I think I'd like to just delete the test if it's supposed to fail (and no longer fails).

____________________ TestReadHtml.test_computer_sales_page _____________________

self = <pandas.tests.io.test_html.TestReadHtml object at 0x1120aa390>

    def test_computer_sales_page(self):
        data = os.path.join(DATA_PATH, 'computer_sales_page.html')
        with tm.assert_raises_regex(ParserError,
                                    r"Passed header=\[0,1\] are "
                                    r"too many rows for this "
                                    r"multi_index of columns"):
>           self.read_html(data, header=[0, 1])

pandas/tests/io/test_html.py:778:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = <pandas.util.testing._AssertRaisesContextmanager object at 0x1120aab50>
exc_type = None, exc_value = None, trace_back = None

    def __exit__(self, exc_type, exc_value, trace_back):
        expected = self.exception

        if not exc_type:
            exp_name = getattr(expected, "__name__", str(expected))
>           raise AssertionError("{0} not raised.".format(exp_name))
E           AssertionError: ParserError not raised.

pandas/util/testing.py:2491: AssertionError

chris-b1 commented 7 years ago

@jowens - can you open a PR with your WIP code? Easier to answer these type of questions that way.

pandas-dev / pandas

ENH: read_html to handle rowspan, colspan #17054

Code Sample, a copy-pastable example if possible

Problem description

Expected Output

Output of pd.show_versions()

Output of `pd.show_versions()`