googlecolab / colabtools

Python libraries for Google Colaboratory
Apache License 2.0
2.12k stars 696 forks source link

Pandas dataframe summarization doesn't handle custom extension arrays #4615

Open softwaredoug opened 1 month ago

softwaredoug commented 1 month ago

Describe the current behavior A clear and concise explanation of what is currently happening.

I'm using a Pandas extension array called SearchArray. When I create a column of this type, then try to have Colab display the dataframe (as in just typing the dataframe bare and expecting HTML output in a notebook) - colab takes upwords of 6-7 minutes to execute.

What seems to be happening is colab is trying to call _summarize_dataframe and calling nunique / unique which visits every value. For extension arrays, its not safe assumption you can easily visit every value (in my case the data is in an inverted index, and uninverting is costly, and can be done on a few rows, but hard on the entire DF)

    # Add additional properties to the output dictionary
    try:
      nunique = column.nunique()
      properties["num_unique_values"] = nunique
    except TypeError:
      pass
    if "samples" not in properties:
      try:
        non_null_values = column[column.notnull()].unique()
        n_samples = min(n_samples, len(non_null_values))
        samples = (
            pd.Series(non_null_values)
            .sample(n_samples, random_state=42)
            .tolist()
        )
        properties["samples"] = samples
image

Describe the expected behavior

Only visit the rows to be displayed for serialization to string. Perhaps fallback to just _replhtml if possible for extension types.

What web browser you are using (Chrome, Firefox, Safari, etc.)

Chrome

Additional context Link to a minimal, public, self-contained notebook that reproduces this issue.

Notebook https://colab.research.google.com/drive/12B5K2Kb4o8djZQV54afRjPPSd9vTiNNs?authuser=1#scrollTo=H6mWXxGhxNFg

softwaredoug commented 1 month ago

For now I am creating a stub implementation of "unique" that returns the current - nonunique - array as a hacky workaround, as unique is practically unsupported for my case.

cperry-goog commented 4 weeks ago

Thanks for this - tracking internally at b/345484881