[Question]: How to classify the data using llama-index(using LLM)

Varshan2003 commented 4 months ago

Question Validation

[X] I have searched both the documentation and discord for an answer.

Question

Hello all, we are facing an issue to classify the jira tickets as FP,Owner reassignments and other, our data is also limited (about 200 comments per category ) we tried different ML models and RAG approache using llama-index, how can we solve this issue?

dosubot[bot] commented 4 months ago

To address the challenge of classifying Jira tickets into categories like FP, Owner reassignments, and others with a limited dataset, consider the following strategies:

Enhance Data Representation: Use LlamaIndex to build a sophisticated representation of your dataset, improving semantic understanding and classification accuracy.
Data Augmentation: Generate synthetic data that is contextually richer to augment your dataset, leading to a more robust training set.
Fine-tune Pre-trained Models: Leverage models like GPT-3.5-turbo for fine-tuning on your specific dataset, which is beneficial for limited datasets.
Few-shot Learning: Utilize few-shot learning with models like GPT-3.5-turbo, providing a few examples of each category to improve classification accuracy.
Active Learning: Incorporate an active learning loop where the model's predictions are reviewed and the most informative examples are added to the training set.
Semi-supervised Learning: Apply techniques such as pseudo-labeling to use the model's predictions on unlabeled data to expand your training dataset.
Optimize Ensemble Methods: Create an ensemble of different model architectures to improve robustness and accuracy.
Hyperparameter Tuning and Evaluation: Adjust parameters and rigorously evaluate the model's performance to find the optimal configuration.

For detailed implementation, refer to the LlamaIndex documentation and examples in the repository. These strategies, combined with the capabilities of LlamaIndex and advanced models, can significantly enhance your ability to classify Jira tickets accurately, even with a limited dataset.

#### Details

**Sources** - [I want to use Llama-index for document classification. How can I use multiple indexes with it?](https://github.com/run-llama/llama_index/issues/1986) - [Custom "ponderation" or "bias" of information to better reflect "information value"](https://github.com/run-llama/llama_index/issues/264) - [llama-index-integrations/readers/llama-index-readers-jira/llama_index/readers/jira/base.py](https://github.com/run-llama/llama_index/blob/main/llama-index-integrations/readers/llama-index-readers-jira/llama_index/readers/jira/base.py) - [llama-datasets/docugami_kg_rag/sec_10_q/llamaindex_baseline.py](https://github.com/run-llama/llama_index/blob/main/llama-datasets/docugami_kg_rag/sec_10_q/llamaindex_baseline.py) - [llama-datasets/eval_llm_survey_paper/llamaindex_baseline.py](https://github.com/run-llama/llama_index/blob/main/llama-datasets/eval_llm_survey_paper/llamaindex_baseline.py) - [docs/docs/optimizing/evaluation/evaluation.md](https://github.com/run-llama/llama_index/blob/main/docs/docs/optimizing/evaluation/evaluation.md) - [docs/docs/optimizing/evaluation/e2e_evaluation.md](https://github.com/run-llama/llama_index/blob/main/docs/docs/optimizing/evaluation/e2e_evaluation.md) - [llama-index-integrations/evaluation/llama-index-evaluation-tonic-validate/llama_index/evaluation/tonic_validate/retrieval_precision.py](https://github.com/run-llama/llama_index/blob/main/llama-index-integrations/evaluation/llama-index-evaluation-tonic-validate/llama_index/evaluation/tonic_validate/retrieval_precision.py) - [llama-index-integrations/evaluation/llama-index-evaluation-tonic-validate/llama_index/evaluation/tonic_validate/augmentation_precision.py](https://github.com/run-llama/llama_index/blob/main/llama-index-integrations/evaluation/llama-index-evaluation-tonic-validate/llama_index/evaluation/tonic_validate/augmentation_precision.py) - [llama-index-core/llama_index/core/evaluation/benchmarks/hotpotqa.py](https://github.com/run-llama/llama_index/blob/main/llama-index-core/llama_index/core/evaluation/benchmarks/hotpotqa.py) - [llama-index-core/llama_index/core/llama_dataset/rag.py](https://github.com/run-llama/llama_index/blob/main/llama-index-core/llama_index/core/llama_dataset/rag.py) - [llama-index-networks/examples/demo/contributor-2/contributor_2/app_query_engine.py](https://github.com/run-llama/llama_index/blob/main/llama-index-networks/examples/demo/contributor-2/contributor_2/app_query_engine.py) - [llama-index-legacy/llama_index/legacy/llama_dataset/rag.py](https://github.com/run-llama/llama_index/blob/main/llama-index-legacy/llama_index/legacy/llama_dataset/rag.py)
**About Dosu** This response is meant to be useful and save you time. It is **not meant to be a precise solution**, but rather a starting point for your own research. Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

To continue the conversation, mention @dosu.

logan-markewich commented 4 months ago

LLM is maybe overkill, but could definitely work. I would just use a pydantic class with a prompt

from pydantic.v1 import BaseModel, Field
from typing import Literal

class TicketClassification(BaseModel):
  owner: str = Field(description="The most likely owner of the ticket based on the ticket data.")
  ticket_type: Literal["feature", "bug", "documentation"] = Field(description="The type of the ticket.")

from llama_index.program.openai import OpenAIPydanticProgram

prompt_template_str = """\
Given the ticket data below, classify the ticket and provide an owner.

Ticket Data:
{ticket_data}
"""
program = OpenAIPydanticProgram.from_defaults(
    output_cls=TicketClassification, prompt_template_str=prompt_template_str, verbose=True
)

output = program(ticket_data=ticket_data)
print(output.owner)
print(output.ticket_type)

More examples from here https://docs.llamaindex.ai/en/stable/module_guides/querying/structured_outputs/pydantic_program/

As for ML models, have you looked into models like Gliner or NuNER? https://huggingface.co/urchade/gliner_base https://huggingface.co/models?sort=trending&search=NuNER

logan-markewich commented 4 months ago

Going to close this out though, as the above are really two of the best suggestions

run-llama / llama_index

[Question]: How to classify the data using llama-index(using LLM) #13355

Question Validation

Question