pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.8k stars 17.98k forks source link

DOC: Hexbin creates inconsistent size of hexagons for nearly the same data, depending on distance from axis limits #44528

Open stelios-c opened 2 years ago

stelios-c commented 2 years ago

Reproducible Example

import pandas as pd
from io import StringIO
csv_data = "x,y\n1,1\n1,2\n2,1\n1,9\n9,1"
csv_data2 = "x,y\n1,1\n1,2\n2,1\n1,3\n3,1"
pd.read_csv(StringIO(csv_data)).plot(kind='hexbin',x='x',y='y',\
                   xlim=[-0.5,10.5],ylim=[0,10.5],\
                   gridsize=[10,10],sharex=False)
pd.read_csv(StringIO(csv_data2)).plot(kind='hexbin',x='x',y='y',\
                   xlim=[-0.5,10.5],ylim=[0,10.5],\
                   gridsize=[10,10],sharex=False)

Issue Description

csv_data gets plotted with big hexagons touching each other as expected. csv_data2 gets plotted with much smaller hexagons, probably because there is no data near the right xlim, upper ylim.

Expected Behavior

I was expecting the same size of hexagon for those two datasets

Installed Versions

INSTALLED VERSIONS ------------------ commit : 945c9ed766a61c7d2c0a7cbb251b6edebf9cb7d5 python : 3.7.10.final.0 python-bits : 64 OS : Linux OS-release : 4.19.0-18-cloud-amd64 Version : #1 SMP Debian 4.19.208-1 (2021-09-29) machine : x86_64 processor : byteorder : little LC_ALL : None LANG : C.UTF-8 LOCALE : en_US.UTF-8 pandas : 1.3.4 numpy : 1.19.5 pytz : 2021.3 dateutil : 2.8.2 pip : 21.3 setuptools : 58.2.0 Cython : None pytest : None hypothesis : None sphinx : None blosc : None feather : None xlsxwriter : None lxml.etree : None html5lib : None pymysql : None psycopg2 : None jinja2 : 3.0.2 IPython : 7.28.0 pandas_datareader: None bs4 : None bottleneck : 1.3.2 fsspec : 2021.10.1 fastparquet : None gcsfs : 2021.10.1 matplotlib : 3.4.3 numexpr : None odfpy : None openpyxl : None pandas_gbq : None pyarrow : 5.0.0 pyxlsb : None s3fs : None scipy : 1.7.1 sqlalchemy : 1.4.26 tables : None tabulate : None xarray : None xlrd : None xlwt : None numba : 0.54.1
stelios-c commented 2 years ago

hexbin_inconsistent

Lstsk commented 2 years ago

Hello, I don't think this is a bug. The hexagon is getting small because it would have to adjust its size to fit data that is so close to each other. Perhaps, you could adjust the grid-size, xlim, and ylim to make the graph look better?

stelios-c commented 2 years ago

Thanks @Copastr . I'm trying to do an A/B type comparison so I want xlim and ylim to be the same. But I agree, it looks like grid size needs to be adjusted to the range of the data. Maybe this should be treated as a documentation bug because https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.plot.hexbin.html only says

gridsizeint or tuple of (int, int), default 100 The number of hexagons in the x-direction. The corresponding number of hexagons in the y-direction is chosen in a way that the hexagons are approximately regular. Alternatively, gridsize can be a tuple with two elements specifying the number of hexagons in the x-direction and the y-direction.

This does not clarify that the hexagons are not all the way to xlim/ylim but only to the extent covered by the data.