pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.33k stars 17.81k forks source link

BUG: 1.5.0rc0 extract_links modifies type first item in tuple #48355

Open 1Mark opened 2 years ago

1Mark commented 2 years ago

Pandas version checks

Reproducible Example

pip install requests 
pip install --upgrade --pre pandas==1.5.0rc0 
import requests
import pandas as pd

r = requests.get(f"https://elcinema.com/en/index/work/country/eg?page=1")
df_one= pd.read_html(r.text)[0]
row_a = df_one["Release Year"].iloc[1]
print(row_a) # `2023`

df_two= pd.read_html(r.text, extract_links="body")[0]
row_b = df_two["Release Year"].iloc[1]
print(row_b) # `('2023', None)`

Issue Description

The print statement returns 2023 The second print statement returns ('2023', None) the first item in the tuple has the wrong type.

extracting_links is changing how the type is being determined.

Expected Behavior

The first argument in the tuple should have the same type as if extract_links was not defined.

In this case it should return (2023, None)

i.e

assert row_a == row_b[0]

Should hold, but it doesn't

Installed Versions

INSTALLED VERSIONS ------------------ commit : 224458ee25d92ccdf289d1ae2741d178df4f323e python : 3.8.10.final.0 python-bits : 64 OS : Linux OS-release : 5.10.102.1-microsoft-standard-WSL2 Version : #1 SMP Wed Mar 2 00:30:59 UTC 2022 machine : x86_64 processor : x86_64 byteorder : little LC_ALL : None LANG : C.UTF-8 LOCALE : en_US.UTF-8 pandas : 1.5.0rc0 numpy : 1.23.2 pytz : 2022.2.1 dateutil : 2.8.2 setuptools : 44.0.0 pip : 22.2.2 Cython : None pytest : None hypothesis : None sphinx : None blosc : None feather : None xlsxwriter : None lxml.etree : 4.9.1 html5lib : 1.1 pymysql : None psycopg2 : None jinja2 : 3.1.2 IPython : 8.4.0 pandas_datareader: None bs4 : 4.11.1 bottleneck : None brotli : None fastparquet : None fsspec : None gcsfs : None matplotlib : None numba : None numexpr : None odfpy : None openpyxl : None pandas_gbq : None pyarrow : None pyreadstat : None pyxlsb : None s3fs : None scipy : None snappy : None sqlalchemy : None tables : None tabulate : None xarray : None xlrd : None xlwt : None zstandard : None tzdata : None
mroeschke commented 2 years ago

Thanks for the report. This may be tricky to address.

A row of the data with extract_links=None is just strings that the TextFileReader will infer types of:

... ['1)', 'Already Happened', 'Series', '2023', '0 Rating disabledRating disabled']...

A row of the data with extract_links="anything" is a sequence of tuples of strings/None, and TextFileReader does not recursively infer types:

... ('1)', None), ('Already Happened', '/en/work/2075121/'), ('Series', None), ('2023', None), ('0 Rating disabledRating disabled', None)]...

So I guess fundamentally the question is if TextFileReader should recursively infer types of container data types.

MarcoGorelli commented 1 year ago

moving off the 2.0 milestone as it's a regression from 1.5