Open thedatadoc opened 5 years ago
Do you have examples of general sites where this would be useful? Off the top of my head seems like a very niche application that you would want to roll your own parser for, but if there are examples out there that illustrate the utility would be helpful for consideration
I'll find a few examples. The specific data I am working with right now is non-public, but I've encountered this elsewhere on public sites.
I could roll my own parser, however, I really like the standard pandas functionality, and I want to minimize the risk of interpreting html differently while retrieving cells. E.g., colspan and rowspan, specifically. I'd like the same code to make those types of decisions and simply populate the table with the desired value.
OK thanks - if you can provide again would be helpful.
FWIW you might be able to override _text_getter
in a subclass to get what you want without interfering with the rest of the logic:
Agreed - already started down that path. I thought it might be a useful feature for others, so I thought I'd write an enhancement request. Having it in pandas is marginally better than maintaining that code myself, but I agree that it is not required for instances where tables are formatted as one would expect. It is particularly useful when the table output is dynamic (so labels or row positions aren't always predictable) and rows can contain different types of information.
For example, an HTML table row may contain summary statistics that do not share the same meaning as the remainder of the table. They are generally styled differently to highlight the fact that they are different. In a perfect world, this data ends up in their own table, but that doesn't seem to be the case all of the time.
In any case, I'll find some examples and include them here and the issue can be prioritized accordingly.
This article includes screenshots of tabular data represented like this: https://www.qresearchsoftware.com/market-research-guide-basic-data-analysis
(I'll still keep working to get live link examples.)
Cool thanks - looking forward to what you can find. Some other options include the match keyword and/or using something like the SoupStrainer with bs4 to remove what you don't want from intermixed tables
When one needs an example of complexity in something that could otherwise be ordered, one need only look at a .gov website :) Here's a pretty good example, I think.
https://www.bls.gov/news.release/empsit.t09.htm
They aren't using the class attribute, but they do use a "headers" attribute on their td elements. If I were parsing this, I couldn't rely entirely on the text in the "Characteristic" column to identify the meaning of the row. I'd need to group by the first value in the headers attribute. In this case, if I had this feature, I think the easiest way generate a meaningful multilevel index would be to join the values dataframe with the headers dataframe on the row index so that I could access that value easily and re-index the table.
Clearly, a clever person can solve the problem ... but having the feature included in pandas helps minimize risk.
do I understand you correctly, that you would like the option to retrieve a DataFrame of attribute values of the corresponding td
element?
So, if you'd do df = pd.read_html('https://www.bls.gov/news.release/empsit.t09.htm', get_attribute = 'headers')
you would get a data frame with frame cell values set to the "headers" attributes and if that attribute is not found, then set it to nan
?
I can see the usefulness of this, but have the same concerns that @WillAyd has, i.e. is it not general enough to handle enough types of situations. E.g. (in your example table) what if they set the interesting attribute on the span
or p
inner elements, what if an desired attribute is set on the tr
element, etc.
So, my concern is whether a possible solution could be made generalized enough, or conversely whether a sufficiently generalized solution would become so complex, that it would become a burden on Pandas to maintain it and it would be better to do it outside pandas.
It is true that just as we don't control the page author's table structure when they choose to mix dissimilar data in a single table's rows or columns, we don't control their use of other HTML within table cells, either. In the case you describe where there is a span
or p
, it would probably be better if there was some table cell handler function that could override _text_getter
altogether. To me, that's a different use case that seems more in line with the thinking behind the converters
in read_csv
, rather than simply specifying a single attribute. A converter-like parameter for read_html could cover both, however, if it could access the table cell element and all of its contents.
It seems the question here is whether or not the pandas maintainers intend for read_html
to be as supremely configurable as read_csv
via the use of method parameters. As @WillAyd mentions above, you can simply override via a subclass - but that is not as approachable for all users of pandas. I'm fine with it, but many of my colleagues whom I've introduced to pandas would struggle with that. They would likely choose to write hundreds of lines of difficult to maintain code.
I do find this functionality useful , as I said, as I've dealt with tables, where understanding of data depends on html attributes (typically class
or id
). I only doubt if this isn't too complex a problem to push into pandas itself, and maybe it would be better to have in a separate package.
In the csv cases, there's never ambiguity what the input cell values are and converters is "just" operation on that value. In html, if you want to get out the attributes of a table, the interesting values may be in tr
, th
, td
, or an attribue on the cell value itself (span
etc.) or other locations. How would you make read_html navigate to the desired attribute in a generalized-but-not-too-complex way?
In the .gov table you showed, could you demonstrate various options? E.g. if the attribute is set on the tr
, how do you point pandas to it, and in what data frame cell(s) does the attribute go into? I get a feeling such things get difficult to model.
This hasn't dropped off my radar - just a bit swamped at work. I'll get back to this before Monday (have some work travel next week).
@topper-123 I think you've convinced me that the attribute name alone is not sufficient for all instances. However, the idea of a converter-style handler would be much more appropriate. A method that overrides the default _text_getter
and accepts the same obj
that _text_getter
accepts would be sufficient to access the element, its text, any attribute on the element, or any of its parent elements and their attributes.
In the case of the .gov table, I don't think I'd need to traverse the DOM to a parent to get what I need. However if I needed to, I could.
I can implement this if you're interested in seeing an example of how I'd solve it.
_text_getter
manipulates the DOM element, right? So are you thinking of a e.g. a dom_converters
attribute, that would just take a callable, and/or perhaps a dict of callables, where the callable would take a DOM element for the relevant parser and call that inside _text_getter
? So would work like this?
>>> url = 'https://www.bls.gov/news.release/empsit.t09.htm'
>>> df = pd.read_html(url, dom_converters=lambda obj: obj.attrib.get('class')) # get classs attribute for every element in frame
>>> df = pd.read_html(url, dom_converters={'td': lambda obj: obj.attrib.get('class')}) # only work on td elements
I like that. Very general functionality and very simple for pandas to maintain.
@WillAyd ?
Yep, exactly.
OK I think a good idea @thedatadoc if you'd like to try a PR. Just a couple comments:
dom_converters
- what does conversion signify? Can we just say attribute
instead?td
elements so to as keep scope limited on first PR. Can always add support for Mappings in subsequent PRsGreat.
td
. This functionality would often be used to get groupings of rows/columns, so for example if a table header's values is | fruits | apples | bananas | meats | chicken | pig |
, we'd ideally want to pull out a attribute of the th
to find out if that column is a first-level item (fruits, meats) and not a second level item (apples, chicken etc.)I agree with your second statement but just think focusing on td to start would be simpler and easier rather than trying to tackle all use cases at once. Depends on implementation in any case
Sent from my iPhone
On Jun 8, 2019, at 8:12 AM, topper-123 notifications@github.com wrote:
Great.
I'm onboard with finding a better name. I'd think this should be more generally useful than just for td. This functionality would often be used to get groupings of rows/columns, so for example if a table header's values is | fruits | apples | bananas | meats | chicken | pig |, we'd ideally want to pull out a attribute of the th to find out if that column is a first-level item (fruits, meats) and not a second level item (apples, chicken etc.) — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.
I'll run with this after I get back from next week's work trip. I've not contributed to pandas yet, so I'll need to do a little reading on the right way to set up a dev environment that isolates my development from the package installed on my machine.
Thanks for the excellent dialogue. I do think it got us closer to a better solution.
I'll find a few examples. The specific data I am working with right now is non-public, but I've encountered this elsewhere on public sites.
I could roll my own parser, however, I really like the standard pandas functionality, and I want to minimize the risk of interpreting html differently while retrieving cells. E.g., colspan and rowspan, specifically. I'd like the same code to make those types of decisions and simply populate the table with the desired value.
I'm having a problem when trying to read non integer colspan using python 3.7.3. I think it was working in a previous version though.
Would this solve that issue?
Did anything ever come of this??? I've searched the whole internet looking for something like this, to no avail... is there even an alternative way of doing it??
I got stuck over this half of yesterday, and since this is one of the first page I found while googling for solution I ll post my solution here. Maybe it help someone in the future.
The problem: some pages have table where the targeted information is not textual, but rather in attribute such as
The current read_html() method is useful when parsing simple tables, however in practice, not every html table has been created with simple parsing in mind. It is often required that we identify the type or purpose of data in a table cell. HTML tables generally express information this via the assignment of a CSS class name to a th or td element.
I propose an optional read_html() string parameter called "get_attribute" that indicates if the parser should get the text of the cell or if it should get the value of an attribute. If get_attribute is None, behavior of read_html() is the same as its current functionality: th/td cell values will be read into the dataframe. If it is a string value, read_html() will retrieve attribute values rather than cell text. If a th or td element has no matching attribute, the value returned is None.
I would use this functionality as follows:
The dataframe myTableClasses would contain information that I could use to map cell styling to the values in myTableData. I imagine it could be useful for other types of information as well, conveyed via other attributes.