Dataset: Collection and Description

Team-thedatatribune commented 2 years ago

Dataset Requirements 📦📋

TL; DR 🥱

This issue is one of the great starting point for the beginners in opensource community, here you can:

share the authentic data
contribute with new APIs for collecting the same
even provide (original) scripts (in any preferred language) for data collection and/or preparation like cleaning

Issue Description:

In the context of the dyPixa project, this task revolves around the crucial need to gather and comprehensively document datasets for training and testing the machine learning models. This issue addresses the following key aspects:

Data Collection Scripts: Define a systematic approach and python (preferred) code for sourcing diverse datasets. This may include acquiring text data from social media, product reviews, and news articles, and images with associated sentiments from public image repositories.
Dataset Documentation: Document (in #7 once uploaded) or raise the concerns related to (existing or required) dataset specifics, source, size, language distribution, and preprocessing. Refer to Issue #7 for detailed documentation guidelines.
Data Quality Assurance: Ensure dataset integrity and consistency and is taken from authentic sources.
Multilingual Considerations: Explore strategies for multilingual datasets.
Collaboration with Contributors: Engage contributors in dataset sourcing and curation.

Types of Data Needed:

For the NLP and color suggestion models to be highly usable and effective, the following types of data should be considered:

Text Data:
- Social media posts
- Product reviews
- News articles
- Sentiment-labeled text in English and Hindi
- Multilingual text data to enhance language support
Image Data:
- Images with associated sentiment labels
- Diverse images representing a wide range of emotions
- Abstract images showcasing various color combinations

By addressing these components and collecting the appropriate types of data, this issue will lay the foundation for robust machine learning model development and further enhancements in the dyPixa project. Your contributions here will greatly advance the project's capabilities. 🚀🌈

dharmraj617 commented 9 months ago

Hey, I am currently working on ML applications. I have some experience in Data Collection. Please Assign this issue to me.

Addy0000 commented 8 months ago

heya, i'd like to work on writing python scripts for collecting data.

Team-thedatatribune commented 8 months ago

Hey, I am currently working on ML applications. I have some experience in Data Collection. Please Assign this issue to me.

@dharmraj617, we require a diverse dataset of poetic content gathered from various platforms, including:

Social media platforms such as Twitter.
News editorial sections.
Haiku poetry, and more.

Your assistance in creating this dataset would be greatly appreciated, with the following key considerations in mind:

Each data point (in this case, poems) should be concise, consisting of no more than 3-4 lines.
We are primarily focused on English poems.
Ensure proper data cleaning, such as removing emojis and extraneous characters.

For further discussion and information, please join the dyPixa Discord server. We look forward to your valuable contributions! 🙌

ravi-prakash1907 commented 8 months ago

heya, i'd like to work on writing python scripts for collecting data.

@Addy000, we currently have a program (here) that's been trained on go_emotions, capable of classifying any given (English) text into one of 28 different emotions.

Now, we're on an exciting new mission. We need a dataset to generate and recommend colors for each of these sentiments. It would be fantastic if you could contribute by providing:

Images/Thumbnails corresponding to each of the 28 emotions.
The finest color sets corresponding to each emotion (at least 5 for each emotion).

For a detailed description, I recommend visiting issue #58.

You can find the complete list of all 28 emotions at https://huggingface.co/SamLowe/roberta-base-go_emotions. 🎨

I'll assign you the issue if you're interested.

Addy0000 commented 8 months ago

@ravi-prakash1907 i went through it, would like to work on it

thedatatribune / dyPixa