modelscope / swift

ms-swift: Use PEFT or Full-parameter to finetune 250+ LLMs or 35+ MLLMs. (Qwen2, GLM4, Internlm2, Yi, Llama3, Llava, MiniCPM-V, Deepseek, Baichuan2, Gemma2, Phi3-Vision, ...)
https://github.com/modelscope/swift/blob/main/docs/source/LLM/index.md
Apache License 2.0
2.13k stars 205 forks source link

Fix dataset concatenation #1193

Closed tastelikefeet closed 1 week ago

tastelikefeet commented 1 week ago

PR type

PR information

Dataset concatenation may raise following errors:

_check_if_features_can_be_aligned
    raise ValueError(
ValueError: The features can't be aligned because the key history of features {'system': Value(dtype='null', id=None), 'history': Sequence(feature=Sequence(feature=Value(dtype='string', id=None), length=-1, id=None), length=-1, id=None), 'query': Value(dtype='string', id=None), 'response': Value(dtype='string', id=None)} has unexpected type - Sequence(feature=Sequence(feature=Value(dtype='string', id=None), length=-1, id=None), length=-1, id=None) (expected either Sequence(feature=Value(dtype='null', id=None), length=-1, id=None) or Value("null").

This is because some dataset has empty values and None values, and another one has normal history values, so the arrow_dataset will treat them as difference types.

How to solve:

reduce column after the dataset instantiated, and before the concatenation.

Experiment results

Paste your experiment result here(if needed).