michelle123lam / lloom

Concept Induction: Analyzing Unstructured Text with High-Level Concepts Using LLooM (CHI 2024 paper). LLooM automatically surfaces high-level concepts to analyze unstructured text.
https://stanfordhci.github.io/lloom
BSD 3-Clause "New" or "Revised" License
58 stars 14 forks source link

ValueError: The column label 'ID' is not unique. #12

Closed shya-me closed 2 days ago

shya-me commented 3 months ago

I have a dataset with two columns- ID and Text. I have specified them when constructing an LLooM instance-

l = wb.lloom(
    df=dataset,
    text_col="Text",
    id_col="ID",  # Optional
)

Now, after concept generation and scoring steps when I try to visualize the data by using

l.vis(slice_col="Text")

I am getting an error-

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
[<ipython-input-54-fcb9df7f7948>](https://localhost:8080/#) in <cell line: 1>()
----> 1 l.vis(slice_col="Text")

8 frames
[/usr/local/lib/python3.10/dist-packages/text_lloom/workbench.py](https://localhost:8080/#) in vis(self, cols_to_show, slice_col, max_slice_bins, slice_bounds, show_highlights, norm_by, export_df, include_outliers)
    648         score_df = self.get_score_df()
    649 
--> 650         widget, matrix_df, item_df, item_df_wide = visualize(
    651             in_df=self.in_df,
    652             score_df=score_df,

[/usr/local/lib/python3.10/dist-packages/text_lloom/concept_induction.py](https://localhost:8080/#) in visualize(in_df, score_df, doc_col, doc_id_col, score_col, df_filtered, df_bullets, concepts, cols_to_show, slice_col, max_slice_bins, slice_bounds, show_highlights, norm_by, debug)
   1404 # - debug: boolean (whether to print debug statements)
   1405 def visualize(in_df, score_df, doc_col, doc_id_col, score_col, df_filtered, df_bullets, concepts, cols_to_show=[], slice_col=None, max_slice_bins=None, slice_bounds=None, show_highlights=False, norm_by=None, debug=False):
-> 1406     matrix_df, item_df, item_df_wide, metadata_dict = prep_vis_dfs(in_df, score_df, doc_id_col, doc_col, score_col, df_filtered, df_bullets, concepts, cols_to_show=cols_to_show, slice_col=slice_col, max_slice_bins=max_slice_bins, slice_bounds=slice_bounds,show_highlights=show_highlights, norm_by=norm_by, debug=debug)
   1407 
   1408     data = matrix_df.to_json(orient='records')

[/usr/local/lib/python3.10/dist-packages/text_lloom/concept_induction.py](https://localhost:8080/#) in prep_vis_dfs(df, score_df, doc_id_col, doc_col, score_col, df_filtered, df_bullets, concepts, cols_to_show, slice_col, max_slice_bins, slice_bounds, show_highlights, norm_by, debug, threshold, outlier_threshold)
   1204 
   1205     # Fetch the results table
-> 1206     df = get_concept_col_df(df, score_df, concepts, doc_id_col, doc_col, score_col, cols_to_show)
   1207     df[doc_id_col] = df[doc_id_col].astype(str)  # Ensure doc_id_col is string type
   1208 

[/usr/local/lib/python3.10/dist-packages/text_lloom/concept_induction.py](https://localhost:8080/#) in get_concept_col_df(df, score_df, concepts, doc_id_col, doc_col, score_col, cols_to_show)
   1127         cur_df[doc_id_col] = cur_df[doc_id_col].astype(str)
   1128         c_df[doc_id_col] = c_df[doc_id_col].astype(str)
-> 1129         cur_df = cur_df.merge(c_df, on=doc_id_col, how="left")
   1130     return cur_df
   1131 

[/usr/local/lib/python3.10/dist-packages/pandas/core/frame.py](https://localhost:8080/#) in merge(self, right, how, on, left_on, right_on, left_index, right_index, sort, suffixes, copy, indicator, validate)
   9841         from pandas.core.reshape.merge import merge
   9842 
-> 9843         return merge(
   9844             self,
   9845             right,

[/usr/local/lib/python3.10/dist-packages/pandas/core/reshape/merge.py](https://localhost:8080/#) in merge(left, right, how, on, left_on, right_on, left_index, right_index, sort, suffixes, copy, indicator, validate)
    146     validate: str | None = None,
    147 ) -> DataFrame:
--> 148     op = _MergeOperation(
    149         left,
    150         right,

[/usr/local/lib/python3.10/dist-packages/pandas/core/reshape/merge.py](https://localhost:8080/#) in __init__(self, left, right, how, on, left_on, right_on, axis, left_index, right_index, sort, suffixes, indicator, validate)
    735             self.right_join_keys,
    736             self.join_names,
--> 737         ) = self._get_merge_keys()
    738 
    739         # validate the merge keys dtypes. We may need to coerce

[/usr/local/lib/python3.10/dist-packages/pandas/core/reshape/merge.py](https://localhost:8080/#) in _get_merge_keys(self)
   1219                         #  the latter of which will raise
   1220                         lk = cast(Hashable, lk)
-> 1221                         left_keys.append(left._get_label_or_level_values(lk))
   1222                         join_names.append(lk)
   1223                     else:

[/usr/local/lib/python3.10/dist-packages/pandas/core/generic.py](https://localhost:8080/#) in _get_label_or_level_values(self, key, axis)
   1790 
   1791             label_axis_name = "column" if axis == 0 else "index"
-> 1792             raise ValueError(
   1793                 f"The {label_axis_name} label '{key}' is not unique.{multi_message}"
   1794             )

ValueError: The column label 'ID' is not unique.
michelle123lam commented 3 months ago

Hi, I'm trying to reproduce the issue on my end, but I'm not able to. Does it work for you to just run the following without the slice column?

l.vis()

The slice column is meant to operate on columns that contain categorical labels or quantitative values (like metadata) and isn't meant for the original text itself (which looks like what you were trying to do given that "Text" was provided as the text_col argument to the LLooM instance). Let me know if you still run into issues though, as perhaps there is something else about your dataset/dataframe that could be leading to this error!