h2oai / h2o-llmstudio

H2O LLM Studio - a framework and no-code GUI for fine-tuning LLMs. Documentation: https://docs.h2o.ai/h2o-llmstudio/
https://h2o.ai
Apache License 2.0
3.96k stars 411 forks source link

[CODE IMPROVEMENT] Import dataset from Hugging Face #335

Closed pascal-pfeiffer closed 2 months ago

pascal-pfeiffer commented 1 year ago

🔧 Proposed code refactoring

Allow simple import of dataset from Hugging Face

from datasets import load_dataset
import pandas as pd

Load your dataset
dataset = load_dataset('fka/awesome-chatgpt-prompts')  # replace with the name of your desired dataset

 Convert to pandas DataFrame
df = pd.DataFrame(dataset['train'])

 Save DataFrame to csv
df.to_csv('my_dataset.csv', index=False)

Motivation

Raised by Maher on Discord (https://discord.com/channels/1097462770674438174/1100718594809147402/1136414136495001680)

pritthakkar commented 11 months ago

@pascal-pfeiffer I would like to work on this issue, can you please give me more details that what exact changes I have to make? Can you please add hactoberfest label to this issue?

pascal-pfeiffer commented 11 months ago

Thank you @pritthakkar , this would be great!

What we had in mind here is to add another data connector to the "Import dataset" page here: image

code wise, this would need to be added to https://github.com/h2oai/h2o-llmstudio/blob/main/llm_studio/app_utils/sections/dataset.py#L75 ff

The import page could borrow the style of the kaggle import page: image

I'd suggest 2 fields:

Things to consider: