Add context to return from pandas.io.html.read_html

cancan101 commented 11 years ago

Currently pandas.io.html.read_html returns a list of DataFrames. This offers no context as to where in the source HTML the table was found. For example, a user might be interested in the title or caption of the table.

For example in an SEC 10-Q filing (see for example: http://apps.shareholder.com/sec/viewerContent.aspx?companyid=GMCR&docid=9277772#A13-6685_110Q_HTM_UNAUDITEDCONSOLIDATEDSTATEMENTSOF_223103), there are many tables. The user might be interested in one or more of them. They are not returned in a "standard" order such that without the title provided in the HTML document, finding the desired table in the return list is difficult.

My suggestion is to offer some way of linking the returned table to the context in which it was found.

cpcloud commented 11 years ago

I think this will probably have to wait for NDFrame metadata to be implemented, so that it (HTML metadata) could be done correctly.

jreback commented 11 years ago

you could return maybe a dict of frames where key is the 'name'?

cpcloud commented 11 years ago

I guess but problems abound when there's no name...I guess then it could be an integer index.

cpcloud commented 11 years ago

or append to a list like {'table_title': df, None: [df1, df2, df3]}

cancan101 commented 11 years ago

@jreback I was thinking about using a dict somehow. I am not sure what would be a meaningful "name" in this case or how I would provide it.

@cpcloud Is this what you mean by NDFrame: http://wesmckinney.com/blog/?p=77 ?

jreback commented 11 years ago

I guess you could use a 'name' regex? (and if it matches, use it, otherwise put them in order?)

cpcloud commented 11 years ago

@cancan101 No, sorry I should have been more clear. NDFrame is just the base class of (soon-to-be) Series, DataFrame and Panel

cpcloud commented 11 years ago

table elements can have titles and captions i think...i don't remember what the preferred tag attribute is tho i have to check it out

cancan101 commented 11 years ago

I was asking what would be the best way for me to provide a function whose job is to annotate the table that is found? i.e a function that maps the Table element to a name.

cpcloud commented 11 years ago

okie doke turns out it's just <caption> which is the title

cancan101 commented 11 years ago

As in would that be a new parameter to pandas.io.html.read_html

cpcloud commented 11 years ago

well first you need to parse that information, currently that element is ignored...

@cancan101 yes that would be a new argument probably called caption='a regex to match strings in the caption'

cpcloud commented 11 years ago

@cancan101 btw you can save yourself a few keystrokes by just using pandas.read_html. no need to refer to the internal name ... this is a top level function :smile:

cancan101 commented 11 years ago

I think it would need to be a function rather than a regex.

cpcloud commented 11 years ago

why a function?

cancan101 commented 11 years ago

In the example provided, I believe that for some tables I would have to pull some element off the parent of the table element (that document does not make use of the caption attribute).

cpcloud commented 11 years ago

you could have the regex match the whole tree instead of just the caption element...what would a function do that re.search(regex, html_soup) wouldn't do?

cancan101 commented 11 years ago

I am not quite sure I follow: that regext would be run over the HTML for the table? Or for the entire document?

cpcloud commented 11 years ago

the entire document....the titles of the tables on your example page aren't semantically connected to the table they are just text styled to look like it

so there's no surefire way to tell that a table has a title without the caption tag.

this means that, in general, you'd have to look at the whole document if you want a title that doesn't follow the conventions of html that make it easier for everyone to use (i'm not ranting, this is in fact the state of most of the tables on the web)

cancan101 commented 11 years ago

I agree that in many cases, it will be a best effort to match up captions to tables (or more precisely, to find the caption for a given table).

The long term solution to issues like this for financial documents is the XBRL format.

That being said in the specific case of the example I provided, they actually seem to use some custom tags that start with efx_ and wrap the table and the title.

cpcloud commented 11 years ago

custom tags as of now make HTML invalid, but see here

read_html is at the mercy of the underlying parser (bs4 in your case bc of the invalidity of the html)

so would you agree that re.search is the way to go?

cancan101 commented 11 years ago

lxml should require only valid XML not valid HTML.

cpcloud commented 11 years ago

indeed it does, however the function is called read_html not read_xml....

cpcloud commented 11 years ago

in many cases it won't matter, in fact if the HTML is syntactically invalid (could still have custom tags, but e.g., a tag that isn't closed is syntactically invalid) then it will be invalid XML as well...many tables are like this

cpcloud commented 11 years ago

most people (incorrectly, most of the time) use tables for styling and not for presenting data there's a google web crawler stats page somewhere with this information

cancan101 commented 11 years ago

That being said, the SEC documents tend to take a pretty regular form where tables mean tables, etc.

Following the code snipped on http://lxml.de/parsing.html#parsing-html and then calling: tree.getroot().findall(".//table") does seem to succeed in finding the tables in the document.

cpcloud commented 11 years ago

yes. a lot of times it will look like it does. a lot of my blood sweat and tears went in to finding out why the heck it would seemingly drop elements...i found out later that it does weird things with invalid html, better to fail fast than to be incorrect and better to be slow and correct

run that page thru the w3 validator and see if it returns valid HTML if so then lxml is going to do the right thing, otherwise it doesn't make any guarantees, thus the strictness on pandas implementation

cancan101 commented 11 years ago

It is true that the validator shows the page as having errors: http://validator.w3.org/check?uri=http%3A%2F%2Fapps.shareholder.com%2Fsec%2FviewerContent.aspx%3Fcompanyid%3DGMCR%26docid%3D9277772&charset=%28detect+automatically%29&doctype=Inline&group=1&user-agent=W3C_Validator%2F1.3+http%3A%2F%2Fvalidator.w3.org%2Fservices

It also looks like that for lxml to parse the document, the HTMLParser must have recover=True (which is not what Pandas sets).

cancan101 commented 11 years ago

If I wanted to sub class _LxmlFrameParser and add it to _valid_parsers, it does not look like there is currently a way to do so.

cpcloud commented 11 years ago

Why can't you subclass it?

cancan101 commented 11 years ago

Sorry, I was a bit vague. I can subclass the parser, but then there is no exposed method to add the new parser as a "flavor" in the map of _valid_parsers.

cpcloud commented 11 years ago

You can place it there manually. What exactly are you trying to achieve? Do you just want it to be less strict? The reason the pandas implementation is so strict is by design.

cancan101 commented 11 years ago

There are a couple of goals in sub-classing:

1) To be less strict. I would like to use lxml if I can. I have had issues with bs4 doing bizarre things with some pages. 2) To add some logic to the parsing having to deal with how table columns are used in the SEC documents. For whatever reason, when the number is negative and parenthesis are used to represent this, the trailing parenthesis ends up in an adjacent cell. I would like to deal with this.

cpcloud commented 11 years ago

Pandas will only parse with lxml if lxml says the document is valid HTML. Recover is set to false to make sure that a parse fails with invalid HTML. This is so that you get predictable behavior. Lxml will not give you a valid HTML doc from an invalid one if the original is invalid , unlike html5lib + bs4, which will do things like close unclosed tags for you.

cpcloud commented 11 years ago

I don't think using lxml if bs4 is giving strange results is going to help you, but maybe for some reason these data parse successfully with lxml. Just be careful with larger tables.

Re parens : it might be time to add a converters argument to read_html.

cancan101 commented 11 years ago

In this specific case I am using the html parser in a best effort manner: If all else fails, I can either enter the data my hand or attempt to source it from elsewhere.

My issues with bs4 where actually on other pages.

cpcloud commented 11 years ago

@cancan101 If you want to subclass _LxmlFrameParser then you need to override _build_doc as that's where the parser will disallow invalid markup. Everything else should be okay after that. Let me know how it goes!

You should also do something like

pandas.io.html._valid_parsers['liberal_lxml'] = _LiberalLxmlFrameParser

Then you should be able to do

dfs = read_html(url, flavor='liberal_lxml')

ghost commented 10 years ago

@cpcloud , can we agree that @cancan101 is planning a dedicated library that will absolutely demolish read_html's feature-set in every way and call it a day?

cpcloud commented 10 years ago

@y-p I detect a hint of sarcasm there.

ghost commented 10 years ago

You try scanning through 600 issues in an afternoon and not becoming slightly snide. I actually was not being cynical, @cancan101 wants much more functionality then read_html has and it makes sense for that to happen in his own project. I'm noting that there are/have been multiple suggestions in this vein from him, hence my remark.

Can we close?

cancan101 commented 10 years ago

The discussion in this issue got pretty broad and off topic, but I am curious as to what @cpcloud meant earlier by "NDFrame metadata". Aside from the specific use case for the results from HTML parsing, having meta data would be useful.

cpcloud commented 10 years ago

@y-p I was joking a bit, guess it was lost in translation :smile: In any event, I wholeheartedly agree with you, extra stuff for html should be in a separate project.

pandas-dev / pandas

Add context to return from pandas.io.html.read_html #4469