telexyz / GPT4VN

Ai cũng có thể tự tạo chatbot bằng huấn luyện chỉ dẫn, với 12G GPU (RTX 3060) và khoảng vài chục MB dữ liệu
108 stars 35 forks source link

Các nguồn dữ liệu (GPT4, ShareGPT, Dolly 2.0 ...) #1

Open tiendung opened 1 year ago

tiendung commented 1 year ago

Kho dữ liệu huấn luyện chỉ dẫn, QnA, hội thoại ...

Dữ liệu nổi bật

Khác

tiendung commented 1 year ago

https://github.com/lm-sys/FastChat/issues/90#issuecomment-1493250773

ShareGPT Dataset:

Zipped jsons with 90 000 conversations from sharegpt. Split in two files with 45k each: part 1: https://files.catbox.moe/bhtp9i.zip part 2: https://files.catbox.moe/ahoivx.zip

Format should work as is for training. Use clean tool to remove html markup: https://github.com/lm-sys/FastChat/blob/main/docs/commands/data_cleaning.md

(Note: I'm just relaying this info from someone who sent it my way. So I don't know anything more than anyone else.)

The entire pre-cleaned 90k conversation dataset is also available here: https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/tree/main/HTML_cleaned_raw_dataset

A pre-cleaned, English only, "unfiltered," and 2048 token split version of the ShareGPT dataset ready for finetuning is available here: https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered

trinhdoduyhungss commented 1 year ago

ViQuAD + Alpaca (Vietnamese) + daily_conversation (Vietnamese) + GPT4ALL in Alpaca format (400K+): v1: https://drive.google.com/file/d/1F121M9f2LNy6RgXWlFBSSm626LWzfAvR/view?usp=sharing v2: https://drive.google.com/file/d/1bE0B3Q86uG26A540acrysF-La07E29f4/view?usp=share_link

tiendung commented 1 year ago

ViQuAD + Alpaca (Vietnamese) + daily_conversation (Vietnamese) + GPT4ALL in Alpaca format (400K+): v1: https://drive.google.com/file/d/1F121M9f2LNy6RgXWlFBSSm626LWzfAvR/view?usp=sharing v2: https://drive.google.com/file/d/1bE0B3Q86uG26A540acrysF-La07E29f4/view?usp=share_link

Nice, em có thể cung cấp dữ liệu dưới dạng nguồn riêng lẻ để mix & match cho dễ đc ko?

trinhdoduyhungss commented 1 year ago

ViQuAD + Alpaca (Vietnamese) + daily_conversation (Vietnamese) + GPT4ALL in Alpaca format (400K+): v1: https://drive.google.com/file/d/1F121M9f2LNy6RgXWlFBSSm626LWzfAvR/view?usp=sharing v2: https://drive.google.com/file/d/1bE0B3Q86uG26A540acrysF-La07E29f4/view?usp=share_link

Nice, em có thể cung cấp dữ liệu dưới dạng nguồn riêng lẻ để mix & match cho dễ đc ko?

Dạ tại đây ạ: https://drive.google.com/drive/folders/156yLw2lZHMu6W4rnEIMNhXx5BLqRz_Yu?usp=sharing

tiendung commented 1 year ago

image https://www.chatorg.ai/blog/chat-language-models-tracker