Closed cancan101 closed 10 years ago
I think this will probably have to wait for NDFrame
metadata to be implemented, so that it (HTML metadata) could be done correctly.
you could return maybe a dict of frames where key is the 'name'?
I guess but problems abound when there's no name...I guess then it could be an integer index.
or append to a list
like {'table_title': df, None: [df1, df2, df3]}
@jreback I was thinking about using a dict somehow. I am not sure what would be a meaningful "name" in this case or how I would provide it.
@cpcloud Is this what you mean by NDFrame
: http://wesmckinney.com/blog/?p=77 ?
I guess you could use a 'name' regex? (and if it matches, use it, otherwise put them in order?)
@cancan101 No, sorry I should have been more clear. NDFrame
is just the base class of (soon-to-be) Series
, DataFrame
and Panel
table elements can have titles and captions i think...i don't remember what the preferred tag attribute is tho i have to check it out
I was asking what would be the best way for me to provide a function whose job is to annotate the table that is found? i.e a function that maps the Table element to a name.
okie doke turns out it's just <caption>
which is the title
As in would that be a new parameter to pandas.io.html.read_html
well first you need to parse that information, currently that element is ignored...
@cancan101 yes that would be a new argument probably called caption='a regex to match strings in the caption'
@cancan101 btw you can save yourself a few keystrokes by just using pandas.read_html
. no need to refer to the internal name ... this is a top level function :smile:
I think it would need to be a function rather than a regex.
why a function?
In the example provided, I believe that for some tables I would have to pull some element off the parent of the table element (that document does not make use of the caption attribute).
you could have the regex match the whole tree instead of just the caption element...what would a function do that re.search(regex, html_soup)
wouldn't do?
I am not quite sure I follow: that regext would be run over the HTML for the table? Or for the entire document?
the entire document....the titles of the tables on your example page aren't semantically connected to the table they are just text styled to look like it
so there's no surefire way to tell that a table has a title without the caption tag.
this means that, in general, you'd have to look at the whole document if you want a title that doesn't follow the conventions of html that make it easier for everyone to use (i'm not ranting, this is in fact the state of most of the tables on the web)
I agree that in many cases, it will be a best effort to match up captions to tables (or more precisely, to find the caption for a given table).
The long term solution to issues like this for financial documents is the XBRL format.
That being said in the specific case of the example I provided, they actually seem to use some custom tags that start with efx_ and wrap the table and the title.
custom tags as of now make HTML invalid, but see here
read_html
is at the mercy of the underlying parser (bs4
in your case bc of the invalidity of the html)
so would you agree that re.search
is the way to go?
lxml should require only valid XML not valid HTML.
indeed it does, however the function is called read_html
not read_xml
....
in many cases it won't matter, in fact if the HTML is syntactically invalid (could still have custom tags, but e.g., a tag that isn't closed is syntactically invalid) then it will be invalid XML as well...many tables are like this
most people (incorrectly, most of the time) use tables for styling and not for presenting data there's a google web crawler stats page somewhere with this information
That being said, the SEC documents tend to take a pretty regular form where tables mean tables, etc.
Following the code snipped on http://lxml.de/parsing.html#parsing-html and then calling:
tree.getroot().findall(".//table")
does seem to succeed in finding the tables in the document.
yes. a lot of times it will look like it does. a lot of my blood sweat and tears went in to finding out why the heck it would seemingly drop elements...i found out later that it does weird things with invalid html, better to fail fast than to be incorrect and better to be slow and correct
run that page thru the w3 validator and see if it returns valid HTML if so then lxml
is going to do the right thing, otherwise it doesn't make any guarantees, thus the strictness on pandas implementation
It is true that the validator shows the page as having errors: http://validator.w3.org/check?uri=http%3A%2F%2Fapps.shareholder.com%2Fsec%2FviewerContent.aspx%3Fcompanyid%3DGMCR%26docid%3D9277772&charset=%28detect+automatically%29&doctype=Inline&group=1&user-agent=W3C_Validator%2F1.3+http%3A%2F%2Fvalidator.w3.org%2Fservices
It also looks like that for lxml to parse the document, the HTMLParser
must have recover=True
(which is not what Pandas sets).
If I wanted to sub class _LxmlFrameParser
and add it to _valid_parsers
, it does not look like there is currently a way to do so.
Why can't you subclass it?
Sorry, I was a bit vague. I can subclass the parser, but then there is no exposed method to add the new parser as a "flavor" in the map of _valid_parsers.
You can place it there manually. What exactly are you trying to achieve? Do you just want it to be less strict? The reason the pandas implementation is so strict is by design.
There are a couple of goals in sub-classing:
1) To be less strict. I would like to use lxml if I can. I have had issues with bs4 doing bizarre things with some pages. 2) To add some logic to the parsing having to deal with how table columns are used in the SEC documents. For whatever reason, when the number is negative and parenthesis are used to represent this, the trailing parenthesis ends up in an adjacent cell. I would like to deal with this.
Pandas will only parse with lxml if lxml says the document is valid HTML. Recover is set to false to make sure that a parse fails with invalid HTML. This is so that you get predictable behavior. Lxml will not give you a valid HTML doc from an invalid one if the original is invalid , unlike html5lib + bs4, which will do things like close unclosed tags for you.
I don't think using lxml if bs4 is giving strange results is going to help you, but maybe for some reason these data parse successfully with lxml. Just be careful with larger tables.
Re parens : it might be time to add a converters argument to read_html.
In this specific case I am using the html parser in a best effort manner: If all else fails, I can either enter the data my hand or attempt to source it from elsewhere.
My issues with bs4 where actually on other pages.
@cancan101 If you want to subclass _LxmlFrameParser
then you need to override _build_doc
as that's where the parser will disallow invalid markup. Everything else should be okay after that. Let me know how it goes!
You should also do something like
pandas.io.html._valid_parsers['liberal_lxml'] = _LiberalLxmlFrameParser
Then you should be able to do
dfs = read_html(url, flavor='liberal_lxml')
@cpcloud , can we agree that @cancan101 is planning a dedicated library that will absolutely demolish read_html's feature-set in every way and call it a day?
@y-p I detect a hint of sarcasm there.
You try scanning through 600 issues in an afternoon and not becoming slightly snide. I actually was not being cynical, @cancan101 wants much more functionality then read_html has and it makes sense for that to happen in his own project. I'm noting that there are/have been multiple suggestions in this vein from him, hence my remark.
Can we close?
The discussion in this issue got pretty broad and off topic, but I am curious as to what @cpcloud meant earlier by "NDFrame metadata". Aside from the specific use case for the results from HTML parsing, having meta data would be useful.
@y-p I was joking a bit, guess it was lost in translation :smile: In any event, I wholeheartedly agree with you, extra stuff for html should be in a separate project.
Currently
pandas.io.html.read_html
returns a list of DataFrames. This offers no context as to where in the source HTML the table was found. For example, a user might be interested in the title or caption of the table.For example in an SEC 10-Q filing (see for example: http://apps.shareholder.com/sec/viewerContent.aspx?companyid=GMCR&docid=9277772#A13-6685_110Q_HTM_UNAUDITEDCONSOLIDATEDSTATEMENTSOF_223103), there are many tables. The user might be interested in one or more of them. They are not returned in a "standard" order such that without the title provided in the HTML document, finding the desired table in the return list is difficult.
My suggestion is to offer some way of linking the returned table to the context in which it was found.