pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.74k stars 17.95k forks source link

BUG: html import (pd.read_html) #51879

Closed Jancs-E closed 1 year ago

Jancs-E commented 1 year ago

Pandas version checks

Reproducible Example

import pandas as pd
df=pd.read_html (filename)

Issue Description

At one point, importing thml table data acquired from the same from the same source, Pandas suddenly rejected file, dropping down with error: /usr/lib64/python3.9/site-packages/bs4/init.py:435: MarkupResemblesLocatorWarning: The input looks more like a filename than markup. You may want to open this file and pass the filehandle into Beautiful Soup. warnings.warn( Traceback (most recent call last): File "/home/janis/Data/Elektreiba/NOMX-04.py", line 146, in df=pd.read_html (jauni_dati) File "/usr/lib64/python3.9/site-packages/pandas/util/_decorators.py", line 331, in wrapper return func(*args, **kwargs) File "/usr/lib64/python3.9/site-packages/pandas/io/html.py", line 1205, in read_html return _parse( File "/usr/lib64/python3.9/site-packages/pandas/io/html.py", line 1006, in _parse raise retained File "/usr/lib64/python3.9/site-packages/pandas/io/html.py", line 986, in _parse tables = p.parse_tables() File "/usr/lib64/python3.9/site-packages/pandas/io/html.py", line 262, in parse_tables tables = self._parse_tables(self._build_doc(), self.match, self.attrs) File "/usr/lib64/python3.9/site-packages/pandas/io/html.py", line 618, in _parse_tables raise ValueError("No tables found") ValueError: No tables found

Two consecutive files (originally misnamed as xls,representing files before and after the problem) are attached. With the first and data before it everything worked fine (no need for additional lib), with the second html5lib was requested with the message: Traceback (most recent call last): File "/home/janis/Data/Elektreiba/NOMX-04.py", line 143, in df=pd.read_html (jauni_dati) File "/usr/lib64/python3.9/site-packages/pandas/util/_decorators.py", line 331, in wrapper return func(*args, **kwargs) File "/usr/lib64/python3.9/site-packages/pandas/io/html.py", line 1205, in read_html return _parse( File "/usr/lib64/python3.9/site-packages/pandas/io/html.py", line 982, in _parse parser = _parser_dispatch(flav) File "/usr/lib64/python3.9/site-packages/pandas/io/html.py", line 931, in _parser_dispatch raise ImportError("html5lib not found, please install it") ImportError: html5lib not found, please install it

Both files look pretty similar and both open the same way in Firefox and Excel.

Expected Behavior

import of html table in both cases example.zip

Installed Versions

INSTALLED VERSIONS

commit : 2e218d10984e9919f0296931d92ea851c6a6faf5 python : 3.9.16.final.0 python-bits : 64 OS : Linux OS-release : 5.15.80 Version : #1 SMP PREEMPT Sun Nov 27 13:28:05 CST 2022 machine : x86_64 processor : Intel(R) Core(TM) i7-10750H CPU @ 2.60GHz byteorder : little LC_ALL : None LANG : lv_LV.UTF-8 LOCALE : lv_LV.UTF-8

pandas : 1.5.3 numpy : 1.23.4 pytz : 2022.6 dateutil : 2.8.2 setuptools : 65.5.0 pip : 23.0.1 Cython : 0.29.32 pytest : None hypothesis : None sphinx : None blosc : None feather : None xlsxwriter : None lxml.etree : 4.9.1 html5lib : 1.1 pymysql : None psycopg2 : None jinja2 : 3.1.2 IPython : None pandas_datareader: None bs4 : 4.11.1 bottleneck : None brotli : 1.0.9 fastparquet : None fsspec : None gcsfs : None matplotlib : None numba : None numexpr : None odfpy : None openpyxl : 3.0.10 pandas_gbq : None pyarrow : None pyreadstat : None pyxlsb : None s3fs : None scipy : None snappy : None sqlalchemy : None tables : None tabulate : None xarray : None xlrd : None xlwt : None zstandard : None tzdata : None

Jancs-E commented 1 year ago

I recall the bug report - suddenly everything started to work without html5lib or other interference