manisnesan / fastchai

Repository capturing deep learning & nlp experiments using fastai & pytorch
Apache License 2.0
2 stars 0 forks source link

Addressing Lack of Data issues in ML projects #66

Open manisnesan opened 4 months ago

manisnesan commented 4 months ago

Below Initial User Prompt optimized using Prompt Engineering GPT

Initial User Prompt

Act as a machine learning expert practitioner. I am working on improving the dataset improvements for a text classification problem with more than 100 targets. Some of the targets don't have enough data samples for the model to learn from. Think step by step and provide techniques to explore as alternatives. For examples creating synthetic data using LLM is one such alternative. Provide the outputs in bulleted list.

Revised User Prompt

Assume the role of a machine learning expert practitioner. My project involves enhancing dataset quality for a text classification challenge, targeting over 100 categories. Some categories suffer from insufficient data samples, impacting the model's learning efficiency. Considering this, please outline a step-by-step strategy to address the data imbalance. Focus on methods like data augmentation, including synthetic data generation with LLMs, and consider other innovative techniques suitable for text data. Present your recommendations as a bulleted list, detailing each approach's advantages and potential limitations.

System Generated

When addressing the issue of insufficient data samples for certain targets in a text classification problem with more than 100 targets, it's crucial to adopt a strategic approach to dataset improvement. Here are several techniques to consider, starting with creating synthetic data using large language models (LLMs):

Each of these techniques has its strengths and considerations. For example, while synthetic data generation with LLMs can quickly increase your dataset size, it's essential to ensure the synthetic data's diversity and accuracy. Similarly, data augmentation methods can introduce variability but must be implemented carefully to avoid altering the meaning of the text in a way that misleads the model. It's often effective to combine several of these strategies, carefully monitoring model performance and data quality throughout the process.