redhat-intel-ai-hackathon-raft-rag / data-processing

Repository for the hackathon ( https://redhat-intel.devpost.com )
Apache License 2.0
0 stars 0 forks source link

collect raw datasets for fine tuning #1

Closed EichiUehara closed 2 days ago

EichiUehara commented 3 weeks ago

requirement

We can use text dataset which can be convertible to question and answer format.

task

Define script to download dataset for training

medical question and answer dataset

https://huggingface.co/datasets/lavita/ChatDoctor-iCliniq https://huggingface.co/datasets/lavita/ChatDoctor-HealthCareMagic-100k

medical announcement

https://www.cdc.gov/floods/about/index.html https://www.who.int/

medical news

https://www.healthline.com/ https://www.npr.org/sections/health/ https://www.ama-assn.org/

more general LLM dataset for fine tuning

https://github.com/Zjh-819/LLMDataHub

path

dataset/rag

related issue

2

commits

https://github.com/redhat-intel-ai-hackathon-raft-rag/monorepo/commit/d9bc83c3715cc947bd5f2338fc6ea6c94580d531 https://github.com/redhat-intel-ai-hackathon-raft-rag/monorepo/commit/616452ab09544925b164f7ff91c80c60b4b71312

kevinroshann commented 2 weeks ago

Hey can you explain more clearly , is it just the python query to download the dataset

EichiUehara commented 2 weeks ago

Your task is coming up how to do the task to meet the requirement.