pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.24k stars 17.79k forks source link

BUG: `read_html` does not properly structure some html table elements (possible `rowspan` or `colspan` issues) #58461

Open jowens opened 4 months ago

jowens commented 4 months ago

Pandas version checks

Reproducible Example

#!/usr/bin/env python3

import pandas as pd
import re
import requests
from io import StringIO

pd.set_option("display.max_columns", None)
pd.set_option("display.max_rows", None)
pd.set_option("display.width", 500)

html = requests.get(
    "https://en.wikipedia.org/wiki/Template:AMD_Radeon_Pro_V_series"
).text

df = pd.read_html(
    StringIO(html),
    match=re.compile("Radeon Pro V620"),
)
print(df)

Issue Description

The bottom right of the ingested table puts entries in the wrong columns near the right side for the last two rows. I did some checking of the HTML source and even though it's got some complex rowspan and colspan directives, it appears to be properly constructed.

Cursor_and___Documents_working_owensgroup_proj_gpustats__jowens_piecaken Template_AMD_Radeon_Pro_V_series_-_Wikipedia

I acknowledge that I'm using a slightly older pandas than is installed, but I looked through recent issues on and checkins to read_html and I don't believe this is fixed/reported.

Expected Behavior

I expect the column called "Memory / L3 Cache" to only be populated in the last row.

I expect the two power entries in the last two rows to be placed in the "TDP" column.

Most of the right side of the bottom two rows is misplaced.

Installed Versions

INSTALLED VERSIONS ------------------ commit : bdc79c146c2e32f2cab629be240f01658cfb6cc2 python : 3.12.3.final.0 python-bits : 64 OS : Darwin OS-release : 23.4.0 Version : Darwin Kernel Version 23.4.0: Fri Mar 15 00:10:42 PDT 2024; root:xnu-10063.101.17~1/RELEASE_ARM64_T6000 machine : arm64 processor : arm byteorder : little LC_ALL : None LANG : en_US.UTF-8 LOCALE : en_US.UTF-8 pandas : 2.2.1 numpy : 1.26.4 pytz : 2024.1 dateutil : 2.8.2 setuptools : 69.5.1 pip : 24.0 Cython : 3.0.10 pytest : 7.4.3 hypothesis : None sphinx : 7.2.6 blosc : None feather : None xlsxwriter : None lxml.etree : 5.2.1 html5lib : 1.1 pymysql : None psycopg2 : None jinja2 : 3.1.3 IPython : 8.23.0 pandas_datareader : None adbc-driver-postgresql: None adbc-driver-sqlite : None bs4 : 4.12.2 bottleneck : None dataframe-api-compat : None fastparquet : None fsspec : None gcsfs : None matplotlib : None numba : None numexpr : None odfpy : None openpyxl : None pandas_gbq : None pyarrow : 13.0.0.dev0+gb7d2f7ffc.d20240415 pyreadstat : None python-calamine : None pyxlsb : None s3fs : None scipy : 1.11.4 sqlalchemy : None tables : None tabulate : None xarray : None xlrd : None zstandard : None tzdata : 2024.1 qtpy : None pyqt5 : None
samukweku commented 4 months ago

@jowens If you use another tool for the extraction other than pandas, do you get a different result?

attack68 commented 4 months ago

I didn't implement any of this, and haven't checked the implementation but my guess is going to be that;

a) reading a grid based table is straight forward. b) accounting for rowspan or colspan separately is an extension to a) which is not too difficult. c) accounting for simulataneous cross-over of rowspan and colspan is really difficult and needs initial passes or very specific structuring. It probably isn't tested.

Can anyone confirm?

jowens commented 4 months ago

If you use another tool for the extraction other than pandas, do you get a different result?

Suggestion for that other tool? I'm happy to try.

samukweku commented 4 months ago

@jowens a quick search on google gives this html-extractor - havent used it though (caveat). i asked the earlier question, to see if there is a tool that does it right and we can compare against them. it seems @attack68 has looked into your question more and may have figured out the possible bug?

jowens commented 4 months ago

Just for posterity, here's the specific Wikipedia revision we're discussing here, in case it gets edited:

https://en.wikipedia.org/w/index.php?title=Template:AMD_Radeon_Pro_V_series&oldid=1220301074

and here's a gist where I extracted everything between <table> and </table>:

https://gist.github.com/jowens/8e42fa17a5af4bc16284cfab56ef1473

jowens commented 4 months ago

html_table_extractor has similar behavior (same errors). Here's a quick test:

https://gist.github.com/jowens/bd15b42accaa20e9c403af89719a5256

(which just has the table manually in the source code).

Here's the last line of the output, which corresponds to what's in the issue description.

['Radeon Pro V620(Navi\xa021)[10][11]\n', 'Nov 4, 2021\n', 'RDNA 2TSMC\xa0N7\n', '26.8×109520 mm2\n', '4608:288:128:7272 CU\n', '18252200\n', '525.6633.6\n', '233.6281.6\n', '33,63840,550\n', '16,81920,275\n', '1,0511,267\n', '32\n', '512\n', 'GDDR6256-bit\n', '128 MB\n', '16000\n', 'PCIe\xa04.0×16\n', '—\n', '300\xa0W\n']
jowens commented 4 months ago

FWIW I just tested with read_html's flavor="lxml" and flavor="bs4" and they both returned identical results.

attack68 commented 4 months ago

So is the summary here that all tools you have tested for parsing this table, including pandas, return the same results, and that those results are all incorrect.

jowens commented 4 months ago

Well, two tools (pandas and html_table_extractor), and those two tools return consistent but incorrect results, where incorrect is compared to how a web browser renders it.

Since these two tools both (appear to) have different code that parses the table's cells / rowspans / colspans, it seems like a possibility that web browsers (I looked at Chrome/Firefox/Safari, each of which [I think] uses a different back end [Chromium/Gecko/WebKit]) might interpret the table differently than these two tools. Web browsers are surely more forgiving of HTML errors.