pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.71k stars 17.92k forks source link

Enhancement request: read_html() parameter to control contents of DataFrame #26636

Open thedatadoc opened 5 years ago

thedatadoc commented 5 years ago

The current read_html() method is useful when parsing simple tables, however in practice, not every html table has been created with simple parsing in mind. It is often required that we identify the type or purpose of data in a table cell. HTML tables generally express information this via the assignment of a CSS class name to a th or td element.

I propose an optional read_html() string parameter called "get_attribute" that indicates if the parser should get the text of the cell or if it should get the value of an attribute. If get_attribute is None, behavior of read_html() is the same as its current functionality: th/td cell values will be read into the dataframe. If it is a string value, read_html() will retrieve attribute values rather than cell text. If a th or td element has no matching attribute, the value returned is None.

I would use this functionality as follows:

myTableData = pandas.read_html( table )[ 0 ]
myTableClasses = pandas.read_html( table, get_attribute = 'class' )[ 0 ]

The dataframe myTableClasses would contain information that I could use to map cell styling to the values in myTableData. I imagine it could be useful for other types of information as well, conveyed via other attributes.

WillAyd commented 5 years ago

Do you have examples of general sites where this would be useful? Off the top of my head seems like a very niche application that you would want to roll your own parser for, but if there are examples out there that illustrate the utility would be helpful for consideration

thedatadoc commented 5 years ago

I'll find a few examples. The specific data I am working with right now is non-public, but I've encountered this elsewhere on public sites.

I could roll my own parser, however, I really like the standard pandas functionality, and I want to minimize the risk of interpreting html differently while retrieving cells. E.g., colspan and rowspan, specifically. I'd like the same code to make those types of decisions and simply populate the table with the desired value.

WillAyd commented 5 years ago

OK thanks - if you can provide again would be helpful.

FWIW you might be able to override _text_getter in a subclass to get what you want without interfering with the rest of the logic:

https://github.com/pandas-dev/pandas/blob/8d124ea4c5200f218db7cea8e3ff504b0045a4e6/pandas/io/html.py#L143

thedatadoc commented 5 years ago

Agreed - already started down that path. I thought it might be a useful feature for others, so I thought I'd write an enhancement request. Having it in pandas is marginally better than maintaining that code myself, but I agree that it is not required for instances where tables are formatted as one would expect. It is particularly useful when the table output is dynamic (so labels or row positions aren't always predictable) and rows can contain different types of information.

For example, an HTML table row may contain summary statistics that do not share the same meaning as the remainder of the table. They are generally styled differently to highlight the fact that they are different. In a perfect world, this data ends up in their own table, but that doesn't seem to be the case all of the time.

In any case, I'll find some examples and include them here and the issue can be prioritized accordingly.

thedatadoc commented 5 years ago

This article includes screenshots of tabular data represented like this: https://www.qresearchsoftware.com/market-research-guide-basic-data-analysis

(I'll still keep working to get live link examples.)

WillAyd commented 5 years ago

Cool thanks - looking forward to what you can find. Some other options include the match keyword and/or using something like the SoupStrainer with bs4 to remove what you don't want from intermixed tables

thedatadoc commented 5 years ago

When one needs an example of complexity in something that could otherwise be ordered, one need only look at a .gov website :) Here's a pretty good example, I think.

https://www.bls.gov/news.release/empsit.t09.htm

They aren't using the class attribute, but they do use a "headers" attribute on their td elements. If I were parsing this, I couldn't rely entirely on the text in the "Characteristic" column to identify the meaning of the row. I'd need to group by the first value in the headers attribute. In this case, if I had this feature, I think the easiest way generate a meaningful multilevel index would be to join the values dataframe with the headers dataframe on the row index so that I could access that value easily and re-index the table.

Clearly, a clever person can solve the problem ... but having the feature included in pandas helps minimize risk.

topper-123 commented 5 years ago

do I understand you correctly, that you would like the option to retrieve a DataFrame of attribute values of the corresponding td element?

So, if you'd do df = pd.read_html('https://www.bls.gov/news.release/empsit.t09.htm', get_attribute = 'headers') you would get a data frame with frame cell values set to the "headers" attributes and if that attribute is not found, then set it to nan?

I can see the usefulness of this, but have the same concerns that @WillAyd has, i.e. is it not general enough to handle enough types of situations. E.g. (in your example table) what if they set the interesting attribute on the span or p inner elements, what if an desired attribute is set on the tr element, etc.

So, my concern is whether a possible solution could be made generalized enough, or conversely whether a sufficiently generalized solution would become so complex, that it would become a burden on Pandas to maintain it and it would be better to do it outside pandas.

thedatadoc commented 5 years ago

It is true that just as we don't control the page author's table structure when they choose to mix dissimilar data in a single table's rows or columns, we don't control their use of other HTML within table cells, either. In the case you describe where there is a span or p, it would probably be better if there was some table cell handler function that could override _text_getter altogether. To me, that's a different use case that seems more in line with the thinking behind the converters in read_csv, rather than simply specifying a single attribute. A converter-like parameter for read_html could cover both, however, if it could access the table cell element and all of its contents.

It seems the question here is whether or not the pandas maintainers intend for read_html to be as supremely configurable as read_csv via the use of method parameters. As @WillAyd mentions above, you can simply override via a subclass - but that is not as approachable for all users of pandas. I'm fine with it, but many of my colleagues whom I've introduced to pandas would struggle with that. They would likely choose to write hundreds of lines of difficult to maintain code.

topper-123 commented 5 years ago

I do find this functionality useful , as I said, as I've dealt with tables, where understanding of data depends on html attributes (typically class or id). I only doubt if this isn't too complex a problem to push into pandas itself, and maybe it would be better to have in a separate package.

In the csv cases, there's never ambiguity what the input cell values are and converters is "just" operation on that value. In html, if you want to get out the attributes of a table, the interesting values may be in tr, th, td, or an attribue on the cell value itself (span etc.) or other locations. How would you make read_html navigate to the desired attribute in a generalized-but-not-too-complex way?

In the .gov table you showed, could you demonstrate various options? E.g. if the attribute is set on the tr , how do you point pandas to it, and in what data frame cell(s) does the attribute go into? I get a feeling such things get difficult to model.

thedatadoc commented 5 years ago

This hasn't dropped off my radar - just a bit swamped at work. I'll get back to this before Monday (have some work travel next week).

thedatadoc commented 5 years ago

@topper-123 I think you've convinced me that the attribute name alone is not sufficient for all instances. However, the idea of a converter-style handler would be much more appropriate. A method that overrides the default _text_getter and accepts the same obj that _text_getter accepts would be sufficient to access the element, its text, any attribute on the element, or any of its parent elements and their attributes.

In the case of the .gov table, I don't think I'd need to traverse the DOM to a parent to get what I need. However if I needed to, I could.

I can implement this if you're interested in seeing an example of how I'd solve it.

topper-123 commented 5 years ago

_text_getter manipulates the DOM element, right? So are you thinking of a e.g. a dom_converters attribute, that would just take a callable, and/or perhaps a dict of callables, where the callable would take a DOM element for the relevant parser and call that inside _text_getter? So would work like this?

>>> url = 'https://www.bls.gov/news.release/empsit.t09.htm'
>>> df = pd.read_html(url, dom_converters=lambda obj:  obj.attrib.get('class'))  # get classs attribute for every element in frame
>>> df = pd.read_html(url, dom_converters={'td': lambda obj:  obj.attrib.get('class')})  # only work on td elements

I like that. Very general functionality and very simple for pandas to maintain.

@WillAyd ?

thedatadoc commented 5 years ago

Yep, exactly.

WillAyd commented 5 years ago

OK I think a good idea @thedatadoc if you'd like to try a PR. Just a couple comments:

topper-123 commented 5 years ago

Great.

WillAyd commented 5 years ago

I agree with your second statement but just think focusing on td to start would be simpler and easier rather than trying to tackle all use cases at once. Depends on implementation in any case

Sent from my iPhone

On Jun 8, 2019, at 8:12 AM, topper-123 notifications@github.com wrote:

Great.

I'm onboard with finding a better name. I'd think this should be more generally useful than just for td. This functionality would often be used to get groupings of rows/columns, so for example if a table header's values is | fruits | apples | bananas | meats | chicken | pig |, we'd ideally want to pull out a attribute of the th to find out if that column is a first-level item (fruits, meats) and not a second level item (apples, chicken etc.) — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

thedatadoc commented 5 years ago

I'll run with this after I get back from next week's work trip. I've not contributed to pandas yet, so I'll need to do a little reading on the right way to set up a dev environment that isolates my development from the package installed on my machine.

Thanks for the excellent dialogue. I do think it got us closer to a better solution.

jonahbrennan commented 5 years ago

I'll find a few examples. The specific data I am working with right now is non-public, but I've encountered this elsewhere on public sites.

I could roll my own parser, however, I really like the standard pandas functionality, and I want to minimize the risk of interpreting html differently while retrieving cells. E.g., colspan and rowspan, specifically. I'd like the same code to make those types of decisions and simply populate the table with the desired value.

I'm having a problem when trying to read non integer colspan using python 3.7.3. I think it was working in a previous version though.

Would this solve that issue?

archiegoodman2 commented 3 years ago

Did anything ever come of this??? I've searched the whole internet looking for something like this, to no avail... is there even an alternative way of doing it??

Ematrion commented 4 weeks ago

I got stuck over this half of yesterday, and since this is one of the first page I found while googling for solution I ll post my solution here. Maybe it help someone in the future.

The problem: some pages have table where the targeted information is not textual, but rather in attribute such as

'class'='DATA OF INTEREST' ... **Example:** (A) link='https://lol.fandom.com/wiki/2024_Season_World_Championship' // match='Swiss Round Opponents' (B) link='https://liquipedia.net/counterstrike/PGL/2024/Copenhagen' // match='Matches' These pages contains game result for esport competition in League of Legend or counter-strike. The data to retrieve in each cell are: - the game outcome (indicated by a color background: green->win, red->lose) - the opponent (indicated by the team logo) The html code is not exactly the same in both cases, but the data where looking for are in 'class' and 'title' for case (A) **First Issue:** mainstage = pd.read_html(link, match=match)[0] (A) Find the targeted table, but loads it empty (B) Find the targeted table, loads it with contextual information such as 'map' or 'intermediate score' **None Solutions:** - dom_converters: does not seems to exist anymore (correct?) - converters: seems to be called after the cell was converted to text with _text_getter() (correct?). The targeted data is already 'lost'. **Working Solution:** ``` import pandas as pd import pandas.io.html as pdhtml import lxml source_cs = 'https://liquipedia.net/counterstrike/PGL/2024/Copenhagen' source_lol = 'https://lol.fandom.com/wiki/2024_Season_World_Championship' match_cs = 'Matches' match_lol = 'Swiss Round Opponents' # https://github.com/pandas-dev/pandas/blob/8d124ea4c5200f218db7cea8e3ff504b0045a4e6/pandas/io/html.py#L143 class LiquipediaParser(pdhtml._LxmlFrameParser): def __init__(self, *args, **kwargs): super().__init__(*args, **kwargs) # https://github.com/pandas-dev/pandas/blob/8d124ea4c5200f218db7cea8e3ff504b0045a4e6/pandas/io/html.py#L143 def _text_getter(self, obj): try: team = obj.attrib['title'] result = obj.attrib['class'].split()[0].replace('swissresults-', '') if result == 'team': # Not a Game result cell (first column) return team elif result == 'teamvs': # Game is Scheduled but has not Been Played Yet return f'{team} - TBD' else: # here we get Opponent and the game result return f'{team} - {result}' except KeyError: # !!! Do not catch all possible Errors return super()._text_getter(obj) # https://github.com/pandas-dev/pandas/blob/8d124ea4c5200f218db7cea8e3ff504b0045a4e6/pandas/io/html.py#L143 pdhtml._valid_parsers['liqui'] = LiquipediaParser # Nice & easy way to get my data mainstage = pd.read_html(source_lol, flavor='liqui', match=match_lol)[0] print(mainstage) # compare to mainstage = pd.read_html(source_lol, match=match_lol)[0] print(mainstage) ``` **Enhancement request:** pd.read_html() seems a good working function has it does find properly the table and is capable to process it. Working on my own parsing tool to extract it would have worked, but would not have the general application desired. It is typical to find logo, meaningful collar background in cells of table. At least in sport data, but flag indicatating a language or an origin should be very frequent ? Giving access to the content before processing to text and giving the responsability to the user to convert it properly would be great. My case is not the most general one, as my cells are processed the same way. I do not need contextual information like the row/column.