Closed lluo-Desktop closed 1 day ago
I also find in '_post_preprocess', preprocess_func is called on train_dataset and val_dataset : res = [] for dataset in [train_dataset, val_dataset]: if dataset is not None and preprocess_func is not None: dataset = preprocess_func(dataset) if dataset is not None and (streaming or len(dataset) > 0) and remove_useless_columns: dataset = _remove_useless_columns(dataset) res.append(dataset)
So, preprcoess_func is called 3 times when I load local dataset?
please pip install ms-swift -U
please
pip install ms-swift -U
I build ms-swift through source code on branch v2.5.0.dev0. Should I update code to v2.5.2?
The version of the main branch is 2.6.0.dev0.
Hi, I find that function :'load_dataset_from_local' (ms-swift-main/swift/llm/utils/dataset.py), call 'preprocess_func' twice :
dataset_list = [] for dataset_path in dataset_path_list: assert isinstance(dataset_path, str) df: DataFrame if dataset_path.endswith('.csv'): dataset = HfDataset.from_csv(dataset_path, na_filter=False) elif dataset_path.endswith('.jsonl') or dataset_path.endswith('.json'): dataset = HfDataset.from_json(dataset_path) else: raise ValueError('The custom dataset only supports CSV, JSONL or JSON format.') dataset = preprocess_func(dataset) if streaming: dataset = dataset.to_iterable_dataset() dataset_list.append(preprocess_func(dataset))
Is ’dataset_list.append(preprocess_func(dataset))‘ repeated called? Can I change that to dataset_list.append(dataset)?