michelle123lam / lloom

Concept Induction: Analyzing Unstructured Text with High-Level Concepts Using LLooM (CHI 2024 paper). LLooM automatically surfaces high-level concepts to analyze unstructured text.
https://stanfordhci.github.io/lloom
BSD 3-Clause "New" or "Revised" License
54 stars 13 forks source link

LLooM

PROJECT PAGE | Paper | Demo Examples

Open In Colab   PyPI text_lloom

LLooM is an interactive text analysis tool introduced as part of an ACM CHI 2024 paper:

Concept Induction: Analyzing Unstructured Text with High-Level Concepts Using LLooM. Michelle S. Lam, Janice Teoh, James Landay, Jeffrey Heer, Michael S. Bernstein. Proceedings of the 2024 CHI Conference on Human Factors in Computing Systems (CHI '24).

LLooM splash figure

What is LLooM?

LLooM is an interactive data analysis tool for unstructured text data, such as social media posts, paper abstracts, and articles. Manual text analysis is laborious and challenging to scale to large datasets, and automated approaches like topic modeling and clustering tend to focus on lower-level keywords that can be difficult for analysts to interpret.

By contrast, the LLooM algorithm turns unstructured text into meaningful high-level concepts that are defined by explicit inclusion criteria in natural language. For example, on a dataset of toxic online comments, while a BERTopic model outputs "women, power, female", LLooM produces concepts such as "Criticism of gender roles" and "Dismissal of women's concerns". We call this process concept induction: a computational process that produces high-level concepts from unstructured text.

The LLooM Workbench is an interactive text analysis tool that visualizes data in terms of the concepts that LLooM surfaces. With the LLooM Workbench, data analysts can inspect the automatically-generated concepts and author their own custom concepts to explore the data.

What can I do with LLooM?

LLooM can assist with a range of data analysis goalsโ€”from preliminary exploratory analysis to theory-driven confirmatory analysis. Analysts can review LLooM concepts to interpret emergent trends in the data, but they can also author concepts to actively seek out certain phenomena in the data. Concepts can be compared with existing metadata or other concepts to perform statistical analyses, generate plots, or train a model.

LLooM pull figure

Example notebooks

Check out the Examples section to walk through case studies using LLooM, including:

Workbench visualization

LLooM Workbench UI

After running concept induction, the Workbench can display an interactive visualization like the one above. LLooM Workbench features include:

How does LLooM work?

LLooM is a concept induction algorithm that extracts and applies concepts to make sense of unstructured text datasets. LLooM leverages large language models (specifically GPT-3.5 and GPT-4 in the current implementation) to synthesize sampled text spans, generate concepts defined by explicit criteria, apply concepts back to data, and iteratively generalize to higher-level concepts.

LLooM splash figure

Get Started

Follow the Get Started instructions on our documentation for a walkthrough of the main LLooM functions to run on your own dataset. We suggest starting with this template Colab Notebook.

This will involve downloading our Python package, available on PyPI as text_lloom. We recommend setting up a virtual environment with venv or conda.

pip install text_lloom

Contact

LLooM is a research prototype and still under active development! Feel free to reach out to Michelle Lam at mlam4@cs.stanford.edu if you have questions, run into issues, or want to contribute.

Citation

If you find this work useful to you, we'd appreciate you citing our paper!

@article{lam2024conceptInduction,
    author = {Lam, Michelle S. and Teoh, Janice and Landay, James and Heer, Jeffrey and Bernstein, Michael S.},
    title = {Concept Induction: Analyzing Unstructured Text with High-Level Concepts Using LLooM},
    year = {2024},
    isbn = {9798400703300},
    publisher = {Association for Computing Machinery},
    address = {New York, NY, USA},
    url = {https://doi.org/10.1145/3613904.3642830},
    doi = {10.1145/3613904.3642830},
    booktitle = {Proceedings of the 2024 CHI Conference on Human Factors in Computing Systems},
    articleno = {933},
    numpages = {28},
    location = {Honolulu, HI, USA},
    series = {CHI '24}
}