Closed Varshan2003 closed 4 months ago
To address the challenge of classifying Jira tickets into categories like FP, Owner reassignments, and others with a limited dataset, consider the following strategies:
Enhance Data Representation: Use LlamaIndex to build a sophisticated representation of your dataset, improving semantic understanding and classification accuracy.
Data Augmentation: Generate synthetic data that is contextually richer to augment your dataset, leading to a more robust training set.
Fine-tune Pre-trained Models: Leverage models like GPT-3.5-turbo for fine-tuning on your specific dataset, which is beneficial for limited datasets.
Few-shot Learning: Utilize few-shot learning with models like GPT-3.5-turbo, providing a few examples of each category to improve classification accuracy.
Active Learning: Incorporate an active learning loop where the model's predictions are reviewed and the most informative examples are added to the training set.
Semi-supervised Learning: Apply techniques such as pseudo-labeling to use the model's predictions on unlabeled data to expand your training dataset.
Optimize Ensemble Methods: Create an ensemble of different model architectures to improve robustness and accuracy.
Hyperparameter Tuning and Evaluation: Adjust parameters and rigorously evaluate the model's performance to find the optimal configuration.
For detailed implementation, refer to the LlamaIndex documentation and examples in the repository. These strategies, combined with the capabilities of LlamaIndex and advanced models, can significantly enhance your ability to classify Jira tickets accurately, even with a limited dataset.
To continue the conversation, mention @dosu.
LLM is maybe overkill, but could definitely work. I would just use a pydantic class with a prompt
from pydantic.v1 import BaseModel, Field
from typing import Literal
class TicketClassification(BaseModel):
owner: str = Field(description="The most likely owner of the ticket based on the ticket data.")
ticket_type: Literal["feature", "bug", "documentation"] = Field(description="The type of the ticket.")
from llama_index.program.openai import OpenAIPydanticProgram
prompt_template_str = """\
Given the ticket data below, classify the ticket and provide an owner.
Ticket Data:
{ticket_data}
"""
program = OpenAIPydanticProgram.from_defaults(
output_cls=TicketClassification, prompt_template_str=prompt_template_str, verbose=True
)
output = program(ticket_data=ticket_data)
print(output.owner)
print(output.ticket_type)
More examples from here https://docs.llamaindex.ai/en/stable/module_guides/querying/structured_outputs/pydantic_program/
As for ML models, have you looked into models like Gliner or NuNER? https://huggingface.co/urchade/gliner_base https://huggingface.co/models?sort=trending&search=NuNER
Going to close this out though, as the above are really two of the best suggestions
Question Validation
Question
Hello all, we are facing an issue to classify the jira tickets as FP,Owner reassignments and other, our data is also limited (about 200 comments per category ) we tried different ML models and RAG approache using llama-index, how can we solve this issue?