modelscope / ms-swift

Use PEFT or Full-parameter to finetune 400+ LLMs or 100+ MLLMs. (LLM: Qwen2.5, Llama3.2, GLM4, Internlm2.5, Yi1.5, Mistral, Baichuan2, DeepSeek, Gemma2, ...; MLLM: Qwen2-VL, Qwen2-Audio, Llama3.2-Vision, Llava, InternVL2, MiniCPM-V-2.6, GLM4v, Xcomposer2.5, Yi-VL, DeepSeek-VL, Phi3.5-Vision, ...)
https://swift.readthedocs.io/zh-cn/latest/Instruction/index.html
Apache License 2.0
4.21k stars 370 forks source link

Question about 'load_dataset_from_local' call 'preprocess_func' twice ? #2396

Closed lluo-Desktop closed 1 day ago

lluo-Desktop commented 6 days ago

Hi, I find that function :'load_dataset_from_local' (ms-swift-main/swift/llm/utils/dataset.py), call 'preprocess_func' twice :

dataset_list = [] for dataset_path in dataset_path_list: assert isinstance(dataset_path, str) df: DataFrame if dataset_path.endswith('.csv'): dataset = HfDataset.from_csv(dataset_path, na_filter=False) elif dataset_path.endswith('.jsonl') or dataset_path.endswith('.json'): dataset = HfDataset.from_json(dataset_path) else: raise ValueError('The custom dataset only supports CSV, JSONL or JSON format.') dataset = preprocess_func(dataset) if streaming: dataset = dataset.to_iterable_dataset() dataset_list.append(preprocess_func(dataset))

Is ’dataset_list.append(preprocess_func(dataset))‘ repeated called? Can I change that to dataset_list.append(dataset)?

lluo-Desktop commented 6 days ago

I also find in '_post_preprocess', preprocess_func is called on train_dataset and val_dataset : res = [] for dataset in [train_dataset, val_dataset]: if dataset is not None and preprocess_func is not None: dataset = preprocess_func(dataset) if dataset is not None and (streaming or len(dataset) > 0) and remove_useless_columns: dataset = _remove_useless_columns(dataset) res.append(dataset)

So, preprcoess_func is called 3 times when I load local dataset?

Jintao-Huang commented 6 days ago

please pip install ms-swift -U

lluo-Desktop commented 6 days ago

please pip install ms-swift -U

I build ms-swift through source code on branch v2.5.0.dev0. Should I update code to v2.5.2?

Jintao-Huang commented 5 days ago

The version of the main branch is 2.6.0.dev0.