vega / altair

Declarative statistical visualization library for Python
https://altair-viz.github.io/
BSD 3-Clause "New" or "Revised" License
9.35k stars 794 forks source link

[SchemaValidationError] 'selection' was unexpected #2251

Closed abhinavsingh closed 3 years ago

abhinavsingh commented 4 years ago

Hi,

I am following a Google Colab which uses altair for visualization.

I don't plan to use interactive notebook, so I am using altair-viewer for rendering via Python terminal. Unfortunately, I run into following exception:

    raise SchemaValidationError(self, err)
altair.utils.schemapi.SchemaValidationError: Invalid specification

        altair.vegalite.v4.schema.channels.Color, validating 'additionalProperties'

        Additional properties are not allowed ('selection' was unexpected)

Here is my sample code (same as one found on colab):

import altair as alt

alt.data_transformers.enable('default', max_rows=None)
alt.renderers.enable('altair_viewer')

# Load data (skipped for brevity)

occupation_filter = alt.selection_multi(fields=["occupation"])
occupation_chart = alt.Chart().mark_bar().encode(
        x="count()",
        y=alt.Y("occupation:N"),
        color=alt.condition(
            occupation_filter,
            alt.Color("occupation:N", scale=alt.Scale(scheme='category20')),
            alt.value("lightgray")),
    ).properties(width=300, height=300, selection=occupation_filter)

# Explicitly call show
occupation_chart.show()

I have tried both pip install altair and pip install git+git://github.com/altair-viz/altair.git

Following example from altair doc seems to work fine from terminal:

>>> import altair as alt
>>> 
>>> # load a simple dataset as a pandas DataFrame
... from vega_datasets import data
>>> cars = data.cars()
>>> chart = alt.Chart(cars).mark_point().encode(
...     x='Horsepower',
...     y='Miles_per_Gallon',
...     color='Origin',
... ).interactive()
>>> chart.show()
Displaying chart at http://localhost:21098/

Please guide how to unblock myself. Apologies if this is a noob question, my first time trying altair.

Thank you!!!

jakevdp commented 4 years ago

You cannot pass a selection directly as a chart property. Instead of

chart.properties(selection=occupation_filter)

try this:

chart.add_selection(occupation_filter)
abhinavsingh commented 4 years ago

You cannot pass a selection directly as a chart property. Instead of

chart.properties(selection=occupation_filter)

try this:

chart.add_selection(occupation_filter)

Thank you for prompt reply.

  1. Unfortunately, that didn't work out, I still run into same issue.
  2. Below is full source code. Note that this is mostly taken as is from Google colab where .properties(selection=occupation_filter) seem to work well.
import os
import pandas as pd
import numpy as np
import altair as alt

DATA_DIR_PATH = '/Users/abhinavsingh/Downloads/ml-100k'

def mask(df, key, function):
    """Returns a filtered dataframe, by applying function to key"""
    return df[function(df[key])]

def flatten_cols(df):
    df.columns = [' '.join(col).strip() for col in df.columns.values]
    return df

pd.options.display.max_rows = 10
pd.options.display.float_format = '{:.3f}'.format
pd.DataFrame.mask = mask
pd.DataFrame.flatten_cols = flatten_cols

alt.data_transformers.enable('default', max_rows=None)
alt.renderers.enable('altair_viewer')

# Since some movies can belong to more than one genre, we create different
# 'genre' columns as follows:
# - all_genres: all the active genres of the movie.
# - genre: randomly sampled from the active genres.
def mark_genres(movies, genres):
    def get_random_genre(gs):
        active = [genre for genre, g in zip(genres, gs) if g == 1]
        if len(active) == 0:
            return 'Other'
        return np.random.choice(active)

    def get_all_genres(gs):
        active = [genre for genre, g in zip(genres, gs) if g == 1]
        if len(active) == 0:
            return 'Other'
        return '-'.join(active)
    movies['genre'] = [
        get_random_genre(gs) for gs in zip(*[movies[genre] for genre in genres])]
    movies['all_genres'] = [
        get_all_genres(gs) for gs in zip(*[movies[genre] for genre in genres])]

# A function that generates a histogram of filtered data.
def filtered_hist(field, label, filter):
    """Creates a layered chart of histograms.
    The first layer (light gray) contains the histogram of the full data, and the
    second contains the histogram of the filtered data.
    Args:
      field: the field for which to generate the histogram.
      label: String label of the histogram.
      filter: an alt.Selection object to be used to filter the data.
    """
    base = alt.Chart().mark_bar().encode(
        x=alt.X(field, bin=alt.Bin(maxbins=10), title=label),
        y="count()",
    ).properties(
        width=300,
    )
    return alt.layer(
        base.transform_filter(filter),
        base.encode(color=alt.value('lightgray'), opacity=alt.value(.7)),
    ).resolve_scale(y='independent')

def main() -> None:
    # Load users
    users_cols = ['user_id', 'age', 'sex', 'occupation', 'zip_code']
    users = pd.read_csv(
        os.path.join(DATA_DIR_PATH, 'u.user'),
        sep='|',
        names=users_cols,
        encoding='latin-1')

    # Load ratings
    ratings_cols = ['user_id', 'movie_id', 'rating', 'unix_timestamp']
    ratings = pd.read_csv(
        os.path.join(DATA_DIR_PATH, 'u.data'),
        sep='\t',
        names=ratings_cols,
        encoding='latin-1')

    # The movies file contains a binary feature for each genre.
    genre_cols = [
        "genre_unknown", "Action", "Adventure", "Animation", "Children", "Comedy",
        "Crime", "Documentary", "Drama", "Fantasy", "Film-Noir", "Horror",
        "Musical", "Mystery", "Romance", "Sci-Fi", "Thriller", "War", "Western"
    ]
    movies_cols = [
        'movie_id', 'title', 'release_date', "video_release_date", "imdb_url"
    ] + genre_cols
    movies = pd.read_csv(
        os.path.join(DATA_DIR_PATH, 'u.item'),
        sep='|',
        names=movies_cols,
        encoding='latin-1')

    # Since the ids start at 1, we shift them to start at 0.
    users["user_id"] = users["user_id"].apply(lambda x: str(x-1))
    ratings["movie_id"] = ratings["movie_id"].apply(lambda x: str(x-1))
    ratings["user_id"] = ratings["user_id"].apply(lambda x: str(x-1))
    ratings["rating"] = ratings["rating"].apply(lambda x: float(x))
    movies["movie_id"] = movies["movie_id"].apply(lambda x: str(x-1))
    movies["year"] = movies['release_date'].apply(
        lambda x: str(x).split('-')[-1])

    # Compute the number of movies to which a genre is assigned.
    genre_occurences = movies[genre_cols].sum().to_dict()

    # Create all_genres and genre columns
    mark_genres(movies, genre_cols)

    # Create one merged DataFrame containing all the movielens data.
    movielens = ratings.merge(movies, on='movie_id').merge(users, on='user_id')

    # The following functions are used to generate interactive Altair charts.
    # We will display histograms of the data, sliced by a given attribute.
    # Create filters to be used to slice the data.
    occupation_filter = alt.selection_multi(fields=["occupation"])
    occupation_chart = alt.Chart().mark_bar().encode(
        x="count()",
        y=alt.Y("occupation:N"),
        color=alt.condition(
            occupation_filter,
            alt.Color("occupation:N", scale=alt.Scale(scheme='category20')),
            alt.value("lightgray")),
    ).properties(width=300, height=300)
    occupation_chart.add_selection(occupation_filter)

    '''
    users_ratings = (
        ratings
        .groupby('user_id', as_index=False)
        .agg({'rating': ['count', 'mean']})
        .flatten_cols()
        .merge(users, on='user_id')
    )

    # Create a chart for the count, and one for the mean.
    alt.hconcat(
        filtered_hist('rating count', '# ratings / user', occupation_filter),
        filtered_hist('rating mean', 'mean user rating', occupation_filter),
        occupation_chart,
        data=users_ratings)
    '''

if __name__ == '__main__':
    main()
jakevdp commented 4 years ago

I can't run your code, because your data does not exist on my system. Can you create a minimal example that reproduces the error?

jakevdp commented 4 years ago

Oh, I see the issue. When you run

 occupation_chart.add_selection(occupation_filter)

It does not modify the chart in place, but rather returns a modified chart (this is true of all Altair Chart methods). So you should write

occupation_chart = occupation_chart.add_selection(occupation_filter)

or, better, chain the add_selection call onto the existing chart specification, as you did with the encode(), properties(), and other method calls.

abhinavsingh commented 4 years ago

or, better, chain the add_selection call onto the existing chart specification, as you did with the encode(), properties(), and other method calls.

Thanks again but still running into same issue. I cannot attach relevant data files to GitHub issue but you can download the sample set from http://files.grouplens.org/datasets/movielens/ml-100k.zip (4.9 Mb), update DATA_DIR_PATH to point to the extracted folder for a reproducible example.

I am also confused over why a different syntax work for colab but fails via terminal. Is there a difference in renderer APIs? See Google Colab screenshot below:

Screen Shot 2020-07-28 at 8 37 27 AM

^^^^ Works in Colab environment

For reference here is output of pip freeze:

altair==4.1.0
altair-data-server==0.4.1
altair-viewer==0.3.0
appnope==0.1.0
attrs==19.3.0
autopep8==1.5.3
backcall==0.2.0
decorator==4.4.2
entrypoints==0.3
importlib-metadata==1.7.0
ipython==7.16.1
ipython-genutils==0.2.0
jedi==0.17.2
Jinja2==2.11.2
jsonschema==3.2.0
MarkupSafe==1.1.1
numpy==1.19.1
pandas==1.0.5
parso==0.7.1
pexpect==4.8.0
pickleshare==0.7.5
portpicker==1.3.1
prompt-toolkit==3.0.5
ptyprocess==0.6.0
pycodestyle==2.6.0
Pygments==2.6.1
pyrsistent==0.16.0
python-dateutil==2.8.1
pytz==2020.1
six==1.15.0
toml==0.10.1
toolz==0.10.0
tornado==6.0.4
traitlets==4.3.3
vega-datasets==0.8.0
wcwidth==0.2.5
zipp==3.1.0
jakevdp commented 4 years ago

The only reason you'd get a SchemaValidationError in one but not the other would be if different Altair versions are installed.

Try running python -m pip freeze instead of a simple pip freeze to make sure you're exporting the same environment you're using with your Python interpreter.

abhinavsingh commented 4 years ago

Yep environment looks perfectly fine to me.

> python -m pip freeze
altair==4.1.0
altair-data-server==0.4.1
altair-viewer==0.3.0
appnope==0.1.0
attrs==19.3.0
autopep8==1.5.3
backcall==0.2.0
decorator==4.4.2
entrypoints==0.3
importlib-metadata==1.7.0
ipython==7.16.1
ipython-genutils==0.2.0
jedi==0.17.2
Jinja2==2.11.2
jsonschema==3.2.0
MarkupSafe==1.1.1
numpy==1.19.1
pandas==1.0.5
parso==0.7.1
pexpect==4.8.0
pickleshare==0.7.5
portpicker==1.3.1
prompt-toolkit==3.0.5
ptyprocess==0.6.0
pycodestyle==2.6.0
Pygments==2.6.1
pyrsistent==0.16.0
python-dateutil==2.8.1
pytz==2020.1
six==1.15.0
toml==0.10.1
toolz==0.10.0
tornado==6.0.4
traitlets==4.3.3
vega-datasets==0.8.0
wcwidth==0.2.5
zipp==3.1.0

FWIW, Google Colab installs altair via pip install git+git://github.com/altair-viz/altair.git. I did try the same too with same result. Outside of that only difference is between the render:

  1. alt.renderers.enable('colab')
  2. alt.renderers.enable('altair_viewer') -- One that I am using locally.

Below is a screenshot of version which Google Colab installs 4.2.0.dev0

Screen Shot 2020-07-29 at 7 44 29 PM

I nuked my virtual environment and followed these steps (this time installing directly from github repo) to reproduce the same result:

> python3 -m venv venv
> source venv/bin/activate
> pip install git+git://github.com/altair-viz/altair.git 
> pip install git+git://github.com/altair-viz/altair_viewer.git
> .... run the script ...
    raise SchemaValidationError(self, err)
altair.utils.schemapi.SchemaValidationError: Invalid specification

        altair.vegalite.v4.schema.channels.Color, validating 'additionalProperties'

        Additional properties are not allowed ('selection' was unexpected)

Result of freeze on new environment:

> python -m pip freeze
altair==4.2.0.dev0
altair-data-server==0.4.1
altair-viewer==0.4.0.dev0
attrs==19.3.0
entrypoints==0.3
importlib-metadata==1.7.0
Jinja2==2.11.2
jsonschema==3.2.0
MarkupSafe==1.1.1
numpy==1.19.1
pandas==1.1.0
portpicker==1.3.1
pyrsistent==0.16.0
python-dateutil==2.8.1
pytz==2020.1
six==1.15.0
toolz==0.10.0
tornado==6.0.4
zipp==3.1.0
jakevdp commented 4 years ago

Ok - if you could creste a minimal reproducible example (complete code) that demonstrates the issue, that would be helpful. I’ve tried to piece one together from what you provided, but I’m unable to reproduce the error.

abhinavsingh commented 4 years ago

Ok - if you could creste a minimal reproducible example (complete code) that demonstrates the issue, that would be helpful. I’ve tried to piece one together from what you provided, but I’m unable to reproduce the error.

Does it open up charts at your end? I can surely put the entire thing into a repo for you to take a look. Will share soon.

abhinavsingh commented 4 years ago

@jakevdp To reproduce, you probably should also add occupation_chart.show(). I kind of skipped it from the code above. So currently, at your end script must be finishing without any visible outputs. Adding show is when the error gets triggered. Sorry somehow missed this line in the above code.

jakevdp commented 4 years ago

Does it open up charts at your end?

Does what open up charts? None of your code includes the data, so I cannot run it. Try to create a short, complete snippet, with no reference to data files on your computer, that I can copy and paste into a terminal to see the error you're seeing.

joelostblom commented 3 years ago

@abhinavsingh I am going through Altair issues to find those that have been resolved and can be closed. Would you be able to close this issue or add a comment with a short reproducible example if there the you are still encountering this issue?

abhinavsingh commented 3 years ago

I encountered this issue while going through a colab, I am unsure if this is no longer an issue. But good to close this now.

vidhant commented 1 year ago

or, better, chain the add_selection call onto the existing chart specification, as you did with the encode(), properties(), and other method calls.

For the record, this worked for me earlier today. Thanks, Jake!