Transparent markers misbehaving in plotly.express.scatter

lgi1sgm commented 2 months ago

Description

I create a scatter plot and use the color and size arguments. One subset gets transparent markers and is not visible in the plot and the legend.

But: if I hover over an area, where a marker should be, the tool tip is appearing, see figure below:

Expected Behavior

Marker should be visible in plot and legend.

Reproduction

Running the code below I get the figure above.

# %%
# Imports

import sys
import plotly
import pandas as pd
import plotly.express as px

print(f'Python version: {sys.version}')  # Mine is: 3.11.9
print(f'Pandas version: {pd.__version__}')  # Mine is: 2.2.2
print(f'Plotly version: {plotly.__version__}')  # Mine is: 5.22.0

# %%
# Create input frame

df = pd.DataFrame(
  [
    [11739,21.329416,10.010795,2,1],
    [20500,21.860714,12.238669,2,2],
    [1504,21.927166,10.314574,2,1],
    [28194,21.257576,12.823945,2,3],
    [9008,21.886381,9.579169,2,1],
    [17073,21.57327,11.087076,2,1],
    [40734,21.069445,11.887547,3,0],
    [36405,22.397081,11.608735,3,0],
    [36919,21.95463,12.856195,3,0],
    [9867,20.893126,10.761697,2,1]
  ],
  columns=['id' ,'x', 'y', 'loop_number', 'repetition']
)

df.set_index('id', inplace=True)

df.x = df.x.astype('float32')
df.y = df.y.astype('float32')
df.loop_number = df.loop_number.astype('category')  # As category to use color labels, not a color bar.
df.repetition = df.repetition.astype('int32')

df.head()

# %%
# Create Scatter Plot

px.scatter(
  df,
  x='x',
  y='y',
  color='loop_number',
  size='repetition',
  labels={
    'color': 'Type',
    'size': 'Size'
  },
  hover_name=df.index
)

Rachmanichou commented 2 months ago

Hi, The problem is with your data. The missing data points have repetition set to zero. Their size is therefore zero and they are invisible.

lgi1sgm commented 2 months ago

Ok, thanks for that.

But is that the intended behavior? I would expect plotly to calculated some reasonable sizes.

What If I wanted to plot some big or small values, like city populations or bacteria diameters represented as size of the markers.

lgi1sgm commented 2 months ago

Workaround for upper example code:

 %%
# Imports

import sys
import plotly
import pandas as pd
import plotly.express as px

print(f'Python version: {sys.version}')  # Mine is: 3.11.9
print(f'Pandas version: {pd.__version__}')  # Mine is: 2.2.2
print(f'Plotly version: {plotly.__version__}')  # Mine is: 5.22.0

# %%
# Create input frame

df = pd.DataFrame(
  [
    [11739,21.329416,10.010795,2,1],
    [20500,21.860714,12.238669,2,2],
    [1504,21.927166,10.314574,2,1],
    [28194,21.257576,12.823945,2,3],
    [9008,21.886381,9.579169,2,1],
    [17073,21.57327,11.087076,2,1],
    [40734,21.069445,11.887547,3,0],
    [36405,22.397081,11.608735,3,0],
    [36919,21.95463,12.856195,3,0],
    [9867,20.893126,10.761697,2,1]
  ],
  columns=['id' ,'x', 'y', 'loop_number', 'repetition']
)

df.set_index('id', inplace=True)

df.x = df.x.astype('float32')
df.y = df.y.astype('float32')
df.loop_number = df.loop_number.astype('category')  # As category to use color labels, not a color bar.
df.repetition = df.repetition.astype('int32')

# ============================================================================
# ----------------------------------------------------------------------------
# This is a workaround for the issue. It seems, that the size is calculated
# directly based on the value of the column. A size of zero seems to lead to a
# marker with the diameter or area of 0.
#
df.repetition = df.repetition + 1
#
# ----------------------------------------------------------------------------
# ============================================================================

df.head()

# %%
# Create Scatter Plot

px.scatter(
  df,
  x='x',
  y='y',
  color='loop_number',
  size='repetition',
  labels={
    'color': 'Type',
    'size': 'Size'
  },
  hover_name=df.index
)

Rachmanichou commented 2 months ago

If you wanted to plot large values, or values with a large span, you would probably have to scale them before hand. For example by substracting by the mean and dividing by the standard deviation: (x - mean)/std. This allows you to have all your values squished onto a -1;1 scale. There are other methods to do so, such as using maximum and minimum values.

lgi1sgm commented 2 months ago

Yes, I understand.

The only open question for me is, whether this behavior is the intended one. I'm not convinced, because if you read the documentation it states:

size (str or int or Series or array-like) – Either a name of a column in data_frame, or a pandas Series or array_like object. Values from this column or array_like are used to assign mark sizes.

The last sentence: "Values from this column (...) are used to assign mark sizes" tells me, that if I use huge marker sizes, then the markers should become huge, but it is not the case.

So this is the image I receive, when I use values from 1'000'000 to 1'000'003.

Now the marker sizes are similar in size, which I expected, but they are not huge which I also expected based on the documentation.

In comparison, this is the image I get if I use seaborn instead. Seaborn somehow calculates marker sizes internally and that is actually the behavior I expected:

Long story short, for me this issue is done, the only question remaining is, whether the maintainers want to adapt the documentation to better reflect what the size functionality is doing.

Example code:

# %%
# Imports

import sys
import seaborn as sns
import plotly
import pandas as pd
import plotly.express as px

print(f'Python version: {sys.version}')  # Mine is: 3.11.9
print(f'Pandas version: {pd.__version__}')  # Mine is: 2.2.2
print(f'Plotly version: {plotly.__version__}')  # Mine is: 5.22.0

# %%
# Create input frame

df = pd.DataFrame(
  [
    [11739,21.329416,10.010795,2,1],
    [20500,21.860714,12.238669,2,2],
    [1504,21.927166,10.314574,2,1],
    [28194,21.257576,12.823945,2,3],
    [9008,21.886381,9.579169,2,1],
    [17073,21.57327,11.087076,2,1],
    [40734,21.069445,11.887547,3,0],
    [36405,22.397081,11.608735,3,0],
    [36919,21.95463,12.856195,3,0],
    [9867,20.893126,10.761697,2,1]
  ],
  columns=['id' ,'x', 'y', 'loop_number', 'repetition']
)

df.set_index('id', inplace=True)

df.x = df.x.astype('float32')
df.y = df.y.astype('float32')
df.loop_number = df.loop_number.astype('category')  # As category to use color labels, not a color bar.
df.repetition = df.repetition.astype('int32')

# ============================================================================
# ----------------------------------------------------------------------------
# This is a workaround for the issue. It seems, that the size is calculated
# directly based on the value of the column. A size of zero seems to lead to a
# marker with the diameter or area of 0.
#
df.repetition = df.repetition + 1000000
#
# ----------------------------------------------------------------------------
# ============================================================================

df.head()

# %%
# Create Scatter Plot

px.scatter(
  df,
  x='x',
  y='y',
  color='loop_number',
  size='repetition',
  labels={
    'color': 'Type',
    'size': 'Size'
  },
  hover_name=df.index
)

# %%
# Compare the result to Seaborn

sns.scatterplot(
  df,
  x='x',
  y='y',
  hue='loop_number',
  size='repetition'
)

plotly / plotly.py

Transparent markers misbehaving in plotly.express.scatter #4664