Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
[X] I have checked that this issue has not already been reported.
[X] I have confirmed this bug exists on the latest version of pandas.
[ ] I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
import pandas as pd
pd.read_html('<table><tr><td>a</td></tr><tr><td colspan="1000000000">b</td></tr></table>')
Issue Description
I am parsing dataframes from varied untrusted HTML sources. Occasionally I'll encounter an HTML table that has a large colspan such as in the example, which causes pandas to allocate memory until the process OOMs.
At the very least, it would be nice to have a column limit to prevent pandas from crashing the program on some non-sensical HTML such as this.
It's possible of course to parse the HTML first and look for colspan attributes that are too big, but this is expensive since the HTML is parsed twice, since I can't pass a parsed tree into read_html(), and it also requires deep knowledge of how pandas is parsing the HTML.
Pandas version checks
[X] I have checked that this issue has not already been reported.
[X] I have confirmed this bug exists on the latest version of pandas.
[ ] I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
Issue Description
I am parsing dataframes from varied untrusted HTML sources. Occasionally I'll encounter an HTML table that has a large colspan such as in the example, which causes pandas to allocate memory until the process OOMs.
Related: https://github.com/pandas-dev/pandas/issues/17054
Expected Behavior
At the very least, it would be nice to have a column limit to prevent pandas from crashing the program on some non-sensical HTML such as this.
It's possible of course to parse the HTML first and look for colspan attributes that are too big, but this is expensive since the HTML is parsed twice, since I can't pass a parsed tree into
read_html()
, and it also requires deep knowledge of how pandas is parsing the HTML.Installed Versions