loubnabnl / santacoder-finetuning

Fine-tune SantaCoder for Code/Text Generation.
Apache License 2.0
184 stars 23 forks source link

Finetuning on multiple language #4

Closed windspirit95 closed 1 year ago

windspirit95 commented 1 year ago

Hi, I am wondering if it is possible to load dataset from multiple languages (c-sharp, python) for finetuning? Do I need to modify code to do that? Thank you ^^

loubnabnl commented 1 year ago

Hi you can load multiple data subsets at the same time by changing data_dir in load_dataset here to a list passed to data_files:

dataset = load_dataset(
       'bigcode/the-stack',
        data_files=["data/c-sharp/*", "data/python/*"]
        split=args.split,
        use_auth_token=True,
        num_proc=args.num_workers if not args.streaming else None,
        streaming=args.streaming,
    )

But note that the model might have trouble generalizing to multiple languages at the same time (+it was already pre-trained on python).

windspirit95 commented 1 year ago

Thank you for your help 👍