Closed windspirit95 closed 1 year ago
Hi you can load multiple data subsets at the same time by changing data_dir
in load_dataset
here to a list passed to data_files
:
dataset = load_dataset(
'bigcode/the-stack',
data_files=["data/c-sharp/*", "data/python/*"]
split=args.split,
use_auth_token=True,
num_proc=args.num_workers if not args.streaming else None,
streaming=args.streaming,
)
But note that the model might have trouble generalizing to multiple languages at the same time (+it was already pre-trained on python).
Thank you for your help 👍
Hi, I am wondering if it is possible to load dataset from multiple languages (c-sharp, python) for finetuning? Do I need to modify code to do that? Thank you ^^