hezarai / hezar

The all-in-one AI library for Persian, supporting a wide variety of tasks and modalities!
https://hezarai.github.io/hezar/
Apache License 2.0
817 stars 44 forks source link

Using a dataset built through builder script for training purposes. #154

Closed mahdiyehebrahimi closed 2 weeks ago

mahdiyehebrahimi commented 2 months ago

Hi, first I would like to thank you for your tremendous work with this library.

The problem I'm having right now is that I would like to fine-tune one of your BERT models on a custom dataset. The training example that you've provided here uses one of your already prepared which is being served on HF with an accompanying dataset config file. However, I've built my own dataset with the help of a dataset script from one of your templates provided here.

Now the problem is specifically in the following lines from the train_text_classification.py script:

train_dataset = Dataset.load(dataset_path, split="train", tokenizer_path=base_model_path)
eval_dataset = Dataset.load(dataset_path, split="test", tokenizer_path=base_model_path)
model = BertTextClassification(BertTextClassificationConfig(id2label=train_dataset.config.id2label))

As I have created my datasets using the load_dataset method, I'm facing two issues. Firstly, I can't pass a tokenizer_path to load_dataset and second, datasets built using load_dataset don't seem to have a config attribute and therefore train_dataset.config.id2label doesn't exist.

I was wondering if there was a way to convert a dataset built using a script with load_dataset to a Hezar dataset or if there's a way to modify the training script such that it works with a custom dataset.

arxyzan commented 2 months ago

Hi @mahdiyehebrahimi. So glad to hear your feedback. You are right. There must be some sort of function to convert or create a config file for the datasets for better flexibility. What I've been doing so far is that I create a dataset_config.yaml file and upload it in the repo. A dataset config file for text classification is as simple as below:

name: text_classification
config_type: dataset
task: text_classification
label_field: label
text_field: text

Save this as a dataset_config.yaml file and upload it in the root of your repo.

It is worth mentioning that you don't need to create datasets and upload them to the Hub with Hezar's specific format to be able to finetune a model. You can always subclass one of the Dataset modules like TextClassificationDataset or ImageCaptioningDataset or even the Dataset class itself (for full control). Then you can easily finetune your model on your data like below:

from hezar.data import TextClassificationDataset, TextClassificationDatasetConfig

class CustomClassificationDataset(TextClassificationDataset):
    def __init__(self, config: TextClassificationDatasetConfig, split, **kwargs):
        super().__init__(config=config, split=split, **kwargs)
        # Customize tokenizer, etc.
        # Customize dataset properties

    def _load(self, split):
        # Implement your own dataset loading, it can be offline too.
        data = load_dataset(self.config.path)
        # Convert to dataframe, etc.
        # Preprocess rows and columns, etc.
        # Anything you want
        return data

    # You can also customize the `getitem` method 
    def __getitem__(self, index):
        text, label = self.data[index]  # This syntax only works if the return object of `self._load()` is a HF Dataset
        # You can also use text, label = self.data.iloc[index] if `self.data` is a pandas dataframe
        # Just make sure that the output of this method must be a dictionary containing `token_ids`, `labels`, etc. This depends on what your model wants actually.

Let me know if this helps or there's anything else I can help you with.

arxyzan commented 2 weeks ago

For anyone crossing here, there is now tutorials on creating custom datasets to use in Hezar in the docs. See https://hezarai.github.io/hezar/tutorial/datasets.html