AmateurAcademic commented 2 weeks ago

Description

We created a model using mostly test data. We should document the results of this, including the analysis of the results. For:

Intents
Entities
Responses

we will attempt to answer the following questions:

What kind of data was used to fine-tune the model?
What model did we use and how was it fine-tuned?
Where does the model perform well?
Where does it perform poorly?
Why does it fail where it performs poorly?
Can this be improved?

DoD

[ ] Document the model and choices of picking this pre-trained model
[ ] Document the data used
[ ] Document the fine-tuning choices
[ ] Answer the additional questions above with the analysis of the results

AmateurAcademic commented 2 weeks ago

Here are some notes I have for writing this up:

Analysis of Zero-Shot NLU-NLG Experiment

Outline for the Analysis

1. Introduction and Motivation

The goal of this project is to demonstrate that a small-scale model can effectively handle both Natural Language Understanding (NLU) and Natural Language Generation (NLG) tasks, while running efficiently on lower-resource hardware. This work aims to fill a gap between traditional approaches, which require multiple independent models and substantial manual intervention, and the large language models (LLMs), which are computationally expensive and raise privacy concerns.

Challenges with Existing Methods

Currently, voice assistants commonly rely on traditional NLU/NLG methods that include separate models for tasks like intent tagging (using random forest or logistic regression), entity tagging (e.g., Conditional Random Fields (CRF)), and template-based NLG for slot filling. Tools like Snips or Mycroft require these traditional techniques to work cohesively, often resulting in complex pipelines and increased resource usage.

The LLM Approach and Its Limitations

Some more recent approaches, such as those using LLMs, have gained traction for handling NLU and NLG tasks through zero-shot or few-shot capabilities. For example, Home Assistant offers integration with LLMs (e.g., OpenAI API) for these purposes. However, LLM-based solutions pose significant challenges for privacy-focused users and those who want to avoid recurring costs or do not have powerful GPU hardware for self-hosting.

Project Goal

The aim of this experiment is to develop an alternative that combines the best aspects of both approaches. Specifically, this project focuses on using a single small language model to achieve effective NLU and NLG, circumventing the limitations of traditional methods while avoiding the resource requirements of LLMs. The desired outcome is a proof-of-concept that could benefit open-source, privacy-conscious voice assistant users.

2. Background and Literature Review

Current State of NLU and NLG

The prevalent state of NLU and NLG in voice assistants involves multiple models, often making use of machine learning algorithms like random forests, CRFs, and hand-written NLG templates. These methods provide robustness but are limited in flexibility and require significant manual effort to adapt to new intents or domains.

There is also increasing usage of LLMs for voice assistant applications, with LLMs capable of achieving higher levels of generalization. However, they come with significant drawbacks, such as heavy computational demands and privacy risks.

Encoder-Decoder Models vs. Decoder-Only Models

Recent research indicates that encoder-decoder models can outperform decoder-only models for tasks like intent and entity tagging, due to their efficient sequence-to-sequence handling. Research papers such as "Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer" (T5 Paper) provide evidence that encoder-decoder architectures are well-suited for zero-shot tasks in NLU and NLG contexts.

Challenges with Annotation and Domain Overlap

A major challenge with NLU is the quality of annotations. Inconsistencies in annotation—due to ambiguous intents or entity definitions—create challenges that models also struggle to overcome. If human annotators exhibit low agreement on labels, models are likely to face similar difficulties, leading to lower performance in tasks like entity extraction and intent classification.

3. Methodology

Model Selection

For this experiment, a T5-based model, specifically Flan-T5, was selected due to its encoder-decoder architecture and fine-tuning on instruction-based tasks. This setup provides a promising balance between capability and resource efficiency, enabling zero-shot NLU and NLG while maintaining efficiency for low-resource hardware. The model was chosen for its ability to perform both intent and entity tagging using a single instance, reducing the computational requirements, which is particularly suitable for low-resource hardware like a Raspberry Pi 4.

Data Preparation

Data Source

The NLU evaluation dataset (https://github.com/xliuhw/NLU-Evaluation-Data) was used as the foundation for training and testing. This dataset contains inconsistencies in annotation, making it a realistic representation of the challenges faced when creating voice assistants for diverse intents and domains. Additionally, a separate repository was developed to benchmark and refine the dataset before using it for this experiment.

Data Filtering

The data was filtered to focus on intents and entities that are suitable for proof-of-concept testing. The overlap between intents/entities was analyzed, and examples of inconsistent annotations were identified and addressed where feasible. This step ensured that the model was fine-tuned on meaningful examples, allowing for a better evaluation of its zero-shot capabilities.

Data Formatting

The dataset was formatted into columns for utterance, domain, intent, and annotated_utterance. The domain represents the skill being used (e.g., "alarm"), while intent describes the specific action (e.g., "set_alarm"). The annotated_utterance was used for entity tagging, structured as phrases with tagged entities (e.g., "wake me up at [time : 7am]"). Additional open-source datasets, such as the Snips dataset (https://github.com/snipsco/snips-nlu-metrics/tree/master/samples), were also used for testing.

Data Benchmarking and Cleaning

An earlier phase of this project involved benchmarking and refining the dataset using a prototype NLU engine built with a simple intent classifier and CRFs for entity extraction. This phase aimed to improve the dataset quality and understand the performance of traditional NLU methods before moving to more advanced encoder-decoder models. The refinement and benchmarking efforts helped identify problematic areas in the dataset, leading to cleaner, more reliable training data for subsequent experiments.

Training Data Processing

The training data was processed into prompted tasks for the model to learn. This process included:

Creating intent classification prompts using templates.
Generating prompts for entity tagging, using either bracketed tags (e.g., [time : 7am]) or slot representations (e.g., 0 0 0 0 time).
Structuring NLG tasks by creating templates to generate natural language responses based on given domain and intent data.

The data processing configurations were specified in a configuration file (config/training_data_processing_config.toml) which allowed precise control over how data was transformed into model-compatible prompts.

Training Process

The model was fine-tuned using a combination of open-source NLU and NLG data. Details of the training setup include hardware specifications, the number of epochs, learning rate, and optimization strategies employed to make the model suitable for running on lower-resource hardware. Training was configured through config/training_config.toml and executed with a Python script (trainer.py). The training aimed to balance both NLU tasks (intent/entity tagging) and NLG response generation, transforming the input data into effective prompted tasks for fine-tuning the T5-based model.

4. Experimentation and Results

Zero-Shot Evaluation

The model's zero-shot capability was evaluated for intent tagging, entity tagging, and generating appropriate NLG responses without specific training for the datasets used. The evaluation involved assessing how accurately the model handled overlapping intents and ambiguous entity types.

Performance Analysis

Intent Tagging

The model performed well in distinguishing between different intents, particularly when there was minimal overlap. Examples of correctly predicted intents, even in ambiguous contexts, are provided.

Entity Tagging

The model faced challenges in tagging entities that had overlapping definitions, such as "date" vs. "timeofday". Examples are included to illustrate specific cases where the model struggled to disambiguate between similar entities.

NLG Responses

Examples of generated responses are provided to evaluate the model’s performance in generating coherent and contextually accurate outputs. These examples highlight both successful and problematic cases, particularly when ambiguity in the input led to unclear responses. The dataset for NLG responses consisted of columns including domain, intent, annotated_utterance, api_call, api_response, and nlg_response, allowing the model to understand and generate appropriate voice responses based on the given context.

Human Agreement Challenges

The model’s struggles in ambiguous cases reflect the broader challenges associated with low annotator agreement. This reinforces that model performance is inherently constrained by data quality and consistency.

5. Discussion

Strengths and Weaknesses

The strengths of the experiment include demonstrating the feasibility of a single encoder-decoder model for both NLU and NLG, offering a privacy-friendly, low-resource alternative to LLMs. However, its weaknesses lie in the handling of ambiguous cases, particularly when intents and entities overlap significantly.

Comparison to Traditional Methods

The performance of the T5-based model is compared against traditional models like CRFs and template-based NLG. The benefits of reduced complexity and integration are highlighted, as well as the ability to achieve zero-shot results without extensive training on domain-specific data.

The benchmarking and data refinement work performed prior to this experiment, which included traditional classifiers such as Naive Bayes, Decision Trees, Random Forests, and CRFs, served as an insightful baseline. This provided context for understanding how well the advanced encoder-decoder model could perform compared to more conventional techniques.

Implications for Open Source Communities

These findings have important implications for open-source voice assistant projects. Specifically, they indicate that a single small model can achieve competitive results, which could lead to simpler, more efficient architectures that respect privacy and are feasible for home use.

6. Conclusion and Future Work

Summary

The experiment successfully demonstrated that a small encoder-decoder model can perform zero-shot NLU and NLG, providing a viable solution for resource-constrained environments while offering reasonable performance.

Future Directions

Areas for future research include refining the model to better handle overlapping entities and intents, potentially through improved data annotation techniques or domain-specific fine-tuning. Further exploration of entity tagging improvements could also draw inspiration from enhancing CRF features, such as experimenting with different feature sets like Brown clustering.

Potential Improvements

Improvements in data quality, especially around clearer annotation guidelines, could significantly enhance model performance. Additionally, exploring architectural modifications to better handle ambiguity could further improve outcomes in challenging scenarios. Building on the benchmarking of traditional methods, integrating additional features or leveraging hybrid models may also boost performance.

To-Do List

Data Cleaning and Benchmarking Results
- Retrieve and include the results from the data cleaning and benchmarking efforts previously mentioned in the README. Include quantitative analysis and benchmarks comparing traditional methods with the current model.
- Section to Update: Data Benchmarking and Cleaning under Methodology and Comparison to Traditional Methods under Discussion.
Literature Citations
- Cite relevant papers discussing the advantages of encoder-decoder models over decoder-only models, especially for low-resource environments. Specific topics include the advantages for certain NLU and NLG tasks.
- Sections to Update: Encoder-Decoder Models vs. Decoder-Only Models under Background and Literature Review.
Model Examples
- Add specific examples of data inputs, successful model predictions, and cases where the model failed, particularly in ambiguous scenarios. These examples should include detailed analyses of what led to incorrect predictions.
- Sections to Update: Performance Analysis under Experimentation and Results.
Data Annotation Quality
- Expand the discussion on data annotation quality and its effect on model performance. Include references to relevant studies on annotation consistency and annotator agreement.
- Sections to Update: Challenges with Annotation and Domain Overlap under Background and Literature Review and Human Agreement Challenges under Experimentation and Results.
Paper Suggestions
- Identify and note papers that could be cited regarding T5, Flan-T5, and comparisons between encoder-decoder models and other architectures. Suggested papers include:
  - "Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer" (T5 Paper)
  - Papers on encoder-decoder advantages in low-resource settings.
- Sections to Update: Encoder-Decoder Models vs. Decoder-Only Models and Model Selection.
Future Work Improvements
- Specify actionable items for future work, such as methods to refine overlapping entity tags or more advanced techniques for model fine-tuning.
- Sections to Update: Future Directions and Potential Improvements under Conclusion and Future Work.
Benchmark Snips and Other Open-Source Taggers
- Benchmark Snips and any other open-source intent and entity taggers using the same dataset used for training and evaluating the model. Compare their performance to the T5-based model.
- Sections to Update: Comparison to Traditional Methods under Discussion and Experimentation and Results under Zero-Shot Evaluation.

AmateurAcademic commented 2 weeks ago

Here is a rough plan for doing all of this stuff:

Plan to Complete Zero-Shot NLU-NLG Analysis in Two Weekends

Goal

To finalize the analysis of the zero-shot NLU-NLG engine by focusing on efficient use of generative AI tools, breaking the work into manageable chunks, and prioritizing key tasks. The aim is to complete this within two weekends, each with approximately 4-8 hours of dedicated work, while potentially leveraging smaller time slots during weekdays.

Strategies to Minimize Workload

1. Use Generative AI More Extensively

Literature Review: Use an LLM with web search capabilities to find and summarize key papers on encoder-decoder models vs. decoder-only models, T5, and scaling impacts on model efficiency. Summarize and cite these directly.
Drafting Sections: Leverage AI to generate drafts of each section, based on existing bullet points or outlines. This can include expanding sections such as the literature review, results analysis, and discussion.
Data Analysis Assistance: Provide structured data to an LLM to identify patterns, evaluate accuracy, or help with error analysis. This saves manual time looking for trends and drafting explanations.

2. Break Down Tasks into Weekend-Size Chunks

Weekend 1 (8 hours total):
- Results Gathering: Extract benchmark results manually from notebooks.
- Benchmark Comparison: Run Snips and other open-source tagger benchmarks on a smaller, representative subset of the data.
- Literature Draft: Use AI to generate an initial literature review and identify key papers to cite.
- Draft Core Sections: Quickly create rough versions of key sections like the methodology and results.
Weekend 2 (8 hours total):
- Polish the Analysis: Refine and finalize the rough drafts from the first weekend, incorporating AI-generated suggestions.
- Data Analysis Assistance: Provide the model outputs and confusion matrices to an LLM to assist with creating insightful performance analyses.
- Finalize Document: Ensure cohesion and flow, create the final version, and prepare for publication.

3. Reprioritize and Simplify Tasks

Skip Non-Essential Tasks: Focus only on the core goals—intent tagging and entity extraction performance. Minimize discussion on less critical areas, such as detailed analysis of every model failure.
Minimize To-Do Scope: Limit the depth of discussions in sections like "Future Work" and "Comparison to Traditional Methods" to save time.

4. Automation for Results Extraction

Skip full automation. Extract key metrics from notebooks manually to save time on writing scripts.

5. Use Efficient Benchmarks

Benchmark Snips and other taggers only on a small subset of data to obtain indicative results.
Automate the collection of benchmarking metrics to save time.

6. Collaborative Workflow with AI Tools

Draft sections iteratively: start with bullet points, ask an LLM to expand them, then polish.
Treat the LLM as a collaborative partner to speed up drafting, re-writing, and editing.

7. Prioritize Key Literature and Citation Work

Limit Literature Search: Find and cite 3-5 key papers that support core concepts.
AI-Assisted Citations: Use the LLM to suggest relevant references for key points in the document.

Summary Plan for Each Weekend

Weekend 1:
- Gather Results manually from notebooks.
- Benchmark Comparison: Run and gather results from Snips and other tools.
- Literature Search using AI and draft key sections.
Weekend 2:
- Polish: Refine the draft from Weekend 1.
- AI-Assisted Analysis: Utilize an LLM for analyzing results and drafting explanations.
- Finalize and Review: Prepare the final version of the document.

AmateurAcademic commented 2 weeks ago

The issues are: #5, #6, #7, #8, #9, #10, #11.

AmateurAcademic commented 1 week ago

Note to self: I am using a notebook in my local repo called report.ipynb to combine it all together. I have very fine-grained todos there, also.

Once I am done, I will also need to delete my ovos intent benchmarking notebook.

secretsauceai / nlu-nlg-engine

Plan: Write up results from the test model #4

Description

DoD

Analysis of Zero-Shot NLU-NLG Experiment

Outline for the Analysis

1. Introduction and Motivation

Challenges with Existing Methods

The LLM Approach and Its Limitations

Project Goal

2. Background and Literature Review

Current State of NLU and NLG

Encoder-Decoder Models vs. Decoder-Only Models

Challenges with Annotation and Domain Overlap

3. Methodology

Model Selection

Data Preparation

Data Source

Data Filtering

Data Formatting

Data Benchmarking and Cleaning

Training Data Processing

Training Process

4. Experimentation and Results

Zero-Shot Evaluation

Performance Analysis

Intent Tagging

Entity Tagging

NLG Responses

Human Agreement Challenges

5. Discussion

Strengths and Weaknesses

Comparison to Traditional Methods

Implications for Open Source Communities

6. Conclusion and Future Work

Summary

Future Directions

Potential Improvements

To-Do List

Plan to Complete Zero-Shot NLU-NLG Analysis in Two Weekends

Goal

Strategies to Minimize Workload

1. Use Generative AI More Extensively

2. Break Down Tasks into Weekend-Size Chunks

3. Reprioritize and Simplify Tasks

4. Automation for Results Extraction

5. Use Efficient Benchmarks

6. Collaborative Workflow with AI Tools

7. Prioritize Key Literature and Citation Work

Summary Plan for Each Weekend