Open jowens opened 4 months ago
@jowens If you use another tool for the extraction other than pandas, do you get a different result?
I didn't implement any of this, and haven't checked the implementation but my guess is going to be that;
a) reading a grid based table is straight forward.
b) accounting for rowspan
or colspan
separately is an extension to a) which is not too difficult.
c) accounting for simulataneous cross-over of rowspan
and colspan
is really difficult and needs initial passes or very specific structuring. It probably isn't tested.
Can anyone confirm?
If you use another tool for the extraction other than pandas, do you get a different result?
Suggestion for that other tool? I'm happy to try.
@jowens a quick search on google gives this html-extractor - havent used it though (caveat). i asked the earlier question, to see if there is a tool that does it right and we can compare against them. it seems @attack68 has looked into your question more and may have figured out the possible bug?
Just for posterity, here's the specific Wikipedia revision we're discussing here, in case it gets edited:
https://en.wikipedia.org/w/index.php?title=Template:AMD_Radeon_Pro_V_series&oldid=1220301074
and here's a gist where I extracted everything between <table>
and </table>
:
https://gist.github.com/jowens/8e42fa17a5af4bc16284cfab56ef1473
html_table_extractor has similar behavior (same errors). Here's a quick test:
https://gist.github.com/jowens/bd15b42accaa20e9c403af89719a5256
(which just has the table manually in the source code).
Here's the last line of the output, which corresponds to what's in the issue description.
['Radeon Pro V620(Navi\xa021)[10][11]\n', 'Nov 4, 2021\n', 'RDNA 2TSMC\xa0N7\n', '26.8×109520 mm2\n', '4608:288:128:7272 CU\n', '18252200\n', '525.6633.6\n', '233.6281.6\n', '33,63840,550\n', '16,81920,275\n', '1,0511,267\n', '32\n', '512\n', 'GDDR6256-bit\n', '128 MB\n', '16000\n', 'PCIe\xa04.0×16\n', '—\n', '300\xa0W\n']
FWIW I just tested with read_html
's flavor="lxml"
and flavor="bs4"
and they both returned identical results.
So is the summary here that all tools you have tested for parsing this table, including pandas, return the same results, and that those results are all incorrect.
Well, two tools (pandas and html_table_extractor), and those two tools return consistent but incorrect results, where incorrect is compared to how a web browser renders it.
Since these two tools both (appear to) have different code that parses the table's cells / rowspans / colspans, it seems like a possibility that web browsers (I looked at Chrome/Firefox/Safari, each of which [I think] uses a different back end [Chromium/Gecko/WebKit]) might interpret the table differently than these two tools. Web browsers are surely more forgiving of HTML errors.
Pandas version checks
[X] I have checked that this issue has not already been reported.
[X] I have confirmed this bug exists on the latest version of pandas.
[X] I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
Issue Description
The bottom right of the ingested table puts entries in the wrong columns near the right side for the last two rows. I did some checking of the HTML source and even though it's got some complex rowspan and colspan directives, it appears to be properly constructed.
I acknowledge that I'm using a slightly older pandas than is installed, but I looked through recent issues on and checkins to
read_html
and I don't believe this is fixed/reported.Expected Behavior
I expect the column called "Memory / L3 Cache" to only be populated in the last row.
I expect the two power entries in the last two rows to be placed in the "TDP" column.
Most of the right side of the bottom two rows is misplaced.
Installed Versions