Open tiendung opened 1 year ago
https://github.com/lm-sys/FastChat/issues/90#issuecomment-1493250773
ShareGPT Dataset:
Zipped jsons with 90 000 conversations from sharegpt. Split in two files with 45k each: part 1: https://files.catbox.moe/bhtp9i.zip part 2: https://files.catbox.moe/ahoivx.zip
Format should work as is for training. Use clean tool to remove html markup: https://github.com/lm-sys/FastChat/blob/main/docs/commands/data_cleaning.md
(Note: I'm just relaying this info from someone who sent it my way. So I don't know anything more than anyone else.)
The entire pre-cleaned 90k conversation dataset is also available here: https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/tree/main/HTML_cleaned_raw_dataset
A pre-cleaned, English only, "unfiltered," and 2048 token split version of the ShareGPT dataset ready for finetuning is available here: https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered
ViQuAD + Alpaca (Vietnamese) + daily_conversation (Vietnamese) + GPT4ALL in Alpaca format (400K+): v1: https://drive.google.com/file/d/1F121M9f2LNy6RgXWlFBSSm626LWzfAvR/view?usp=sharing v2: https://drive.google.com/file/d/1bE0B3Q86uG26A540acrysF-La07E29f4/view?usp=share_link
ViQuAD + Alpaca (Vietnamese) + daily_conversation (Vietnamese) + GPT4ALL in Alpaca format (400K+): v1: https://drive.google.com/file/d/1F121M9f2LNy6RgXWlFBSSm626LWzfAvR/view?usp=sharing v2: https://drive.google.com/file/d/1bE0B3Q86uG26A540acrysF-La07E29f4/view?usp=share_link
Nice, em có thể cung cấp dữ liệu dưới dạng nguồn riêng lẻ để mix & match cho dễ đc ko?
ViQuAD + Alpaca (Vietnamese) + daily_conversation (Vietnamese) + GPT4ALL in Alpaca format (400K+): v1: https://drive.google.com/file/d/1F121M9f2LNy6RgXWlFBSSm626LWzfAvR/view?usp=sharing v2: https://drive.google.com/file/d/1bE0B3Q86uG26A540acrysF-La07E29f4/view?usp=share_link
Nice, em có thể cung cấp dữ liệu dưới dạng nguồn riêng lẻ để mix & match cho dễ đc ko?
Dạ tại đây ạ: https://drive.google.com/drive/folders/156yLw2lZHMu6W4rnEIMNhXx5BLqRz_Yu?usp=sharing
Kho dữ liệu huấn luyện chỉ dẫn, QnA, hội thoại ...
Dữ liệu nổi bật
Khác