read_html ignores paragraphs in table cells

sasan00 commented 5 years ago

Code Sample, a copy-pastable example if possible

# Your code here
import pandas as pd

html = """
<html>
<body>
<table>
    <tr>
        <td>
            <p>Field 1</p>
            <p>Field 2</p>
        </td>
        <td>
            <p>Value 1</p>
            <p>Value 2</p>
        </td>
    </tr>
</table>
</body>
</html>
"""

tables = pd.read_html(html)
print(tables[0].iat[0, 0])

Problem description

In the current implementation, the p tags are ignored, and therefore it's not possible to infer that field 1 has value 1 and field 2 has value 2.

Expected Output

tables[0].iat[0, 0] == r'Field 1\nField 2'
tables[0].iat[0, 1] == r'Value 1\nValue 2'

Output of `pd.show_versions()`

[paste the output of ``pd.show_versions()`` here below this line] INSTALLED VERSIONS ------------------ commit: None python: 3.7.1.final.0 python-bits: 64 OS: Windows OS-release: 10 machine: AMD64 processor: Intel64 Family 6 Model 63 Stepping 2, GenuineIntel byteorder: little LC_ALL: None LANG: None LOCALE: None.None pandas: 0.23.4 pytest: None pip: 18.1 setuptools: 40.6.3 Cython: None numpy: 1.15.4 scipy: None pyarrow: None xarray: None IPython: None sphinx: None patsy: None dateutil: 2.7.5 pytz: 2018.9 blosc: None bottleneck: None tables: None numexpr: None feather: None matplotlib: None openpyxl: None xlrd: None xlwt: None xlsxwriter: None lxml: 4.3.0 bs4: 4.7.1 html5lib: 1.0.1 sqlalchemy: None pymysql: None psycopg2: None jinja2: None s3fs: None fastparquet: None pandas_gbq: None pandas_datareader: None

TomAugspurger commented 5 years ago

Can you write the exact expected output?

sasan00 commented 5 years ago

I have updated the issue with the requested information. Couldn't find a way to remove the "Needs Info" label.

TomAugspurger commented 5 years ago

Thanks. Can you check if the HTML parsing libraries (lxml, bs4) typically convert p tags to newlines? Do they provide options to do that?

sasan00 commented 5 years ago

That wouldn't help as the below example shows:

import pandas as pd

html = """
<html>
<body>
<table>
    <tr>
        <td>
            Field 1

            Field 2
        </td>
        <td>
            Value 1

            Value 2
        </td>
    </tr>
</table>
</body>
</html>
"""

tables = pd.read_html(html)
print(tables[0].iat[0, 0])

Still returns "Field 1 Field 2"

TomAugspurger commented 5 years ago

I'm just wondering if our behavior matches the expected behavior of the underlying parsing libraries, and whether they have ways of dealing with it. Presumably they've had requests for similar features around whitespace normalization.

On Thu, Jan 17, 2019 at 10:38 AM sasan00 notifications@github.com wrote:

That wouldn't help as the below example shows:

import pandas as pd

html = """

Field 1 Field 2 Value 1 Value 2
"""

tables = pd.read_html(html)print(tables[0].iat[0, 0])

Still returns "Field 1 Field 2"

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/pandas-dev/pandas/issues/24766#issuecomment-455240169, or mute the thread https://github.com/notifications/unsubscribe-auth/ABQHInwNpzOb1pvbnIjUgP387yVDxBzwks5vEKbvgaJpZM4Z-i-t .

sasan00 commented 5 years ago

lxml respects whitespaces.

import pandas as pd
from lxml.etree import fromstring
from lxml.html import HTMLParser

html = """
<html>
<body>
<table>
    <tr>
        <td>
            Field 1

            Field 2
        </td>
        <td>
            Value 1

            Value 2
        </td>
    </tr>
</table>
</body>
</html>
"""

tables = pd.read_html(html)
print(tables[0].iat[0, 0])
parser = HTMLParser()
root = fromstring(html, parser)
for elem in root.iter('td'):
    print(repr(elem.text))

Result:

Field 1 Field 2 '\n Field 1\n \n Field 2\n ' '\n Value 1\n \n Value 2\n '

TomAugspurger commented 5 years ago

Thanks. Can you check if pandas explicitly strips / normalizes whitespace in read_html then? If so, this would be a good parameter to add to read_html.

On Thu, Jan 17, 2019 at 10:49 AM sasan00 notifications@github.com wrote:

lxml respects whitespaces.

import pandas as pdfrom lxml.etree import fromstringfrom lxml.html import HTMLParser

html = """

Field 1 Field 2 Value 1 Value 2
"""

tables = pd.read_html(html)print(tables[0].iat[0, 0]) parser = HTMLParser() root = fromstring(html, parser)for elem in root.iter('td'): print(repr(elem.text))

Result:

Field 1 Field 2 '\n Field 1\n \n Field 2\n ' '\n Value 1\n \n Value 2\n '

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/pandas-dev/pandas/issues/24766#issuecomment-455244452, or mute the thread https://github.com/notifications/unsubscribe-auth/ABQHIt86UF1iJOARwLdx1oJBZzcjMMWWks5vEKmjgaJpZM4Z-i-t .

sasan00 commented 5 years ago

Yes. In _parse_raw_data, calls are made to _remove_whitespacefor each column in each row using the default value of the regex argument which is _RE_WHITESPACE whose value is re.compile(r'[\r\n]+|\s{2,}').

I think whether whitespaces are "cleaned up" (i.e., replaced with a single space character) should be an optional functionality.

TomAugspurger commented 5 years ago

Thanks for investigating. I think an option to disable that behavior makes sense.

You've given two examples now, one with newlines in the text, and one with <p> tags. Do you expect to normalize the <p> tags to newlines, so that the two would give the same output? Do we have any prior art to copy here?

sasan00 commented 5 years ago

I think adding an extra argument as a function that takes the raw text of a cell, and returns the "cleaned up" version would work best. Its default value would be _remove_whitespace to ensure backwards compatibility.

markmbaum commented 3 years ago

Hi, wondering if this issue was ever resolved? In my case, I have a <ul> inside the HTML table and all the elements of each list are squished together after the table is parsed by read_html.