neokd / DataStorehouse

DataStoreHouse is an open-source project that aims to create a collaborative platform for gathering and sharing a wide variety of datasets. It provides a centralised repository where individuals and organisations can contribute, discover, and collaborate on diverse datasets for various domains.
https://datash.vercel.app
MIT License
18 stars 22 forks source link

Datasets for LLM's #125

Closed neokd closed 1 year ago

neokd commented 1 year ago

Description

Add datasets that can be used to fine tune LLM. This issue is open for any type of dataset for respective LLM's (LLAMA2, MISTRAL,etc)

Expected Behaviour

The dataset should be large in size and should follow formats for respective LLM's

VigneshRamanathan101 commented 1 year ago

@neokd do we need our own dataset or can we refer external data sets.

I find some external dataset from Kaggle and other sources

neokd commented 1 year ago

We can have from external sources too.

VigneshRamanathan101 commented 1 year ago

We can have from external sources too.

I have a few external links how do want me to add them

  1. Add a readme in new folder /Datastore/StoreHouse/LLM And put all the references in them
  2. Download and add them to this repo (which could make our repo larger and also maybe has some licensing conflicts)

I would suggest the 1st approach

neokd commented 1 year ago

Share in this issue we'll check and add them to the repo

VigneshRamanathan101 commented 1 year ago

https://huggingface.co/datasets/Hello-SimpleAI/HC3 https://huggingface.co/datasets/MohamedRashad/ChatGPT-prompts/blob/main/train.csv https://github.com/Zjh-819/LLMDataHub - this is github repo with large LLM datasets. https://www.kaggle.com/datasets/radek1/additional-train-data-for-llm-science-exam

VigneshRamanathan101 commented 1 year ago

Can you assign it to me and close the issue?

VigneshRamanathan101 commented 1 year ago

@neokd cam we close this issue or do we need Anything more.

neokd commented 1 year ago

https://huggingface.co/datasets/Hello-SimpleAI/HC3

You can add it as an readme or some file referencing the links something similar to https://github.com/Zjh-819/LLMDataHub this repo which u shared

VigneshRamanathan101 commented 1 year ago

Sure @neokd . I will create the readme.

VigneshRamanathan101 commented 1 year ago

Created a new PR for this issue