Open AmateurAcademic opened 2 weeks ago
Here are some notes I have for writing this up:
The goal of this project is to demonstrate that a small-scale model can effectively handle both Natural Language Understanding (NLU) and Natural Language Generation (NLG) tasks, while running efficiently on lower-resource hardware. This work aims to fill a gap between traditional approaches, which require multiple independent models and substantial manual intervention, and the large language models (LLMs), which are computationally expensive and raise privacy concerns.
Currently, voice assistants commonly rely on traditional NLU/NLG methods that include separate models for tasks like intent tagging (using random forest or logistic regression), entity tagging (e.g., Conditional Random Fields (CRF)), and template-based NLG for slot filling. Tools like Snips or Mycroft require these traditional techniques to work cohesively, often resulting in complex pipelines and increased resource usage.
Some more recent approaches, such as those using LLMs, have gained traction for handling NLU and NLG tasks through zero-shot or few-shot capabilities. For example, Home Assistant offers integration with LLMs (e.g., OpenAI API) for these purposes. However, LLM-based solutions pose significant challenges for privacy-focused users and those who want to avoid recurring costs or do not have powerful GPU hardware for self-hosting.
The aim of this experiment is to develop an alternative that combines the best aspects of both approaches. Specifically, this project focuses on using a single small language model to achieve effective NLU and NLG, circumventing the limitations of traditional methods while avoiding the resource requirements of LLMs. The desired outcome is a proof-of-concept that could benefit open-source, privacy-conscious voice assistant users.
The prevalent state of NLU and NLG in voice assistants involves multiple models, often making use of machine learning algorithms like random forests, CRFs, and hand-written NLG templates. These methods provide robustness but are limited in flexibility and require significant manual effort to adapt to new intents or domains.
There is also increasing usage of LLMs for voice assistant applications, with LLMs capable of achieving higher levels of generalization. However, they come with significant drawbacks, such as heavy computational demands and privacy risks.
Recent research indicates that encoder-decoder models can outperform decoder-only models for tasks like intent and entity tagging, due to their efficient sequence-to-sequence handling. Research papers such as "Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer" (T5 Paper) provide evidence that encoder-decoder architectures are well-suited for zero-shot tasks in NLU and NLG contexts.
A major challenge with NLU is the quality of annotations. Inconsistencies in annotation—due to ambiguous intents or entity definitions—create challenges that models also struggle to overcome. If human annotators exhibit low agreement on labels, models are likely to face similar difficulties, leading to lower performance in tasks like entity extraction and intent classification.
For this experiment, a T5-based model, specifically Flan-T5, was selected due to its encoder-decoder architecture and fine-tuning on instruction-based tasks. This setup provides a promising balance between capability and resource efficiency, enabling zero-shot NLU and NLG while maintaining efficiency for low-resource hardware. The model was chosen for its ability to perform both intent and entity tagging using a single instance, reducing the computational requirements, which is particularly suitable for low-resource hardware like a Raspberry Pi 4.
The NLU evaluation dataset (https://github.com/xliuhw/NLU-Evaluation-Data) was used as the foundation for training and testing. This dataset contains inconsistencies in annotation, making it a realistic representation of the challenges faced when creating voice assistants for diverse intents and domains. Additionally, a separate repository was developed to benchmark and refine the dataset before using it for this experiment.
The data was filtered to focus on intents and entities that are suitable for proof-of-concept testing. The overlap between intents/entities was analyzed, and examples of inconsistent annotations were identified and addressed where feasible. This step ensured that the model was fine-tuned on meaningful examples, allowing for a better evaluation of its zero-shot capabilities.
The dataset was formatted into columns for utterance
, domain
, intent
, and annotated_utterance
. The domain
represents the skill being used (e.g., "alarm"), while intent
describes the specific action (e.g., "set_alarm"). The annotated_utterance
was used for entity tagging, structured as phrases with tagged entities (e.g., "wake me up at [time : 7am]"). Additional open-source datasets, such as the Snips dataset (https://github.com/snipsco/snips-nlu-metrics/tree/master/samples), were also used for testing.
An earlier phase of this project involved benchmarking and refining the dataset using a prototype NLU engine built with a simple intent classifier and CRFs for entity extraction. This phase aimed to improve the dataset quality and understand the performance of traditional NLU methods before moving to more advanced encoder-decoder models. The refinement and benchmarking efforts helped identify problematic areas in the dataset, leading to cleaner, more reliable training data for subsequent experiments.
The training data was processed into prompted tasks for the model to learn. This process included:
[time : 7am]
) or slot representations (e.g., 0 0 0 0 time
).domain
and intent
data.The data processing configurations were specified in a configuration file (config/training_data_processing_config.toml
) which allowed precise control over how data was transformed into model-compatible prompts.
The model was fine-tuned using a combination of open-source NLU and NLG data. Details of the training setup include hardware specifications, the number of epochs, learning rate, and optimization strategies employed to make the model suitable for running on lower-resource hardware. Training was configured through config/training_config.toml
and executed with a Python script (trainer.py
). The training aimed to balance both NLU tasks (intent/entity tagging) and NLG response generation, transforming the input data into effective prompted tasks for fine-tuning the T5-based model.
The model's zero-shot capability was evaluated for intent tagging, entity tagging, and generating appropriate NLG responses without specific training for the datasets used. The evaluation involved assessing how accurately the model handled overlapping intents and ambiguous entity types.
The model performed well in distinguishing between different intents, particularly when there was minimal overlap. Examples of correctly predicted intents, even in ambiguous contexts, are provided.
The model faced challenges in tagging entities that had overlapping definitions, such as "date" vs. "timeofday". Examples are included to illustrate specific cases where the model struggled to disambiguate between similar entities.
Examples of generated responses are provided to evaluate the model’s performance in generating coherent and contextually accurate outputs. These examples highlight both successful and problematic cases, particularly when ambiguity in the input led to unclear responses. The dataset for NLG responses consisted of columns including domain
, intent
, annotated_utterance
, api_call
, api_response
, and nlg_response
, allowing the model to understand and generate appropriate voice responses based on the given context.
The model’s struggles in ambiguous cases reflect the broader challenges associated with low annotator agreement. This reinforces that model performance is inherently constrained by data quality and consistency.
The strengths of the experiment include demonstrating the feasibility of a single encoder-decoder model for both NLU and NLG, offering a privacy-friendly, low-resource alternative to LLMs. However, its weaknesses lie in the handling of ambiguous cases, particularly when intents and entities overlap significantly.
The performance of the T5-based model is compared against traditional models like CRFs and template-based NLG. The benefits of reduced complexity and integration are highlighted, as well as the ability to achieve zero-shot results without extensive training on domain-specific data.
The benchmarking and data refinement work performed prior to this experiment, which included traditional classifiers such as Naive Bayes, Decision Trees, Random Forests, and CRFs, served as an insightful baseline. This provided context for understanding how well the advanced encoder-decoder model could perform compared to more conventional techniques.
These findings have important implications for open-source voice assistant projects. Specifically, they indicate that a single small model can achieve competitive results, which could lead to simpler, more efficient architectures that respect privacy and are feasible for home use.
The experiment successfully demonstrated that a small encoder-decoder model can perform zero-shot NLU and NLG, providing a viable solution for resource-constrained environments while offering reasonable performance.
Areas for future research include refining the model to better handle overlapping entities and intents, potentially through improved data annotation techniques or domain-specific fine-tuning. Further exploration of entity tagging improvements could also draw inspiration from enhancing CRF features, such as experimenting with different feature sets like Brown clustering.
Improvements in data quality, especially around clearer annotation guidelines, could significantly enhance model performance. Additionally, exploring architectural modifications to better handle ambiguity could further improve outcomes in challenging scenarios. Building on the benchmarking of traditional methods, integrating additional features or leveraging hybrid models may also boost performance.
Data Cleaning and Benchmarking Results
Data Benchmarking and Cleaning
under Methodology
and Comparison to Traditional Methods
under Discussion
.Literature Citations
Encoder-Decoder Models vs. Decoder-Only Models
under Background and Literature Review
.Model Examples
Performance Analysis
under Experimentation and Results
.Data Annotation Quality
Challenges with Annotation and Domain Overlap
under Background and Literature Review
and Human Agreement Challenges
under Experimentation and Results
.Paper Suggestions
Encoder-Decoder Models vs. Decoder-Only Models
and Model Selection
.Future Work Improvements
Future Directions
and Potential Improvements
under Conclusion and Future Work
.Benchmark Snips and Other Open-Source Taggers
Comparison to Traditional Methods
under Discussion
and Experimentation and Results
under Zero-Shot Evaluation
.Here is a rough plan for doing all of this stuff:
To finalize the analysis of the zero-shot NLU-NLG engine by focusing on efficient use of generative AI tools, breaking the work into manageable chunks, and prioritizing key tasks. The aim is to complete this within two weekends, each with approximately 4-8 hours of dedicated work, while potentially leveraging smaller time slots during weekdays.
Weekend 1 (8 hours total):
Weekend 2 (8 hours total):
The issues are: #5, #6, #7, #8, #9, #10, #11.
Note to self: I am using a notebook in my local repo called report.ipynb
to combine it all together. I have very fine-grained todos there, also.
Once I am done, I will also need to delete my ovos intent benchmarking notebook.
Description
We created a model using mostly test data. We should document the results of this, including the analysis of the results. For:
we will attempt to answer the following questions:
DoD