hezarai / hezar

The all-in-one AI library for Persian, supporting a wide variety of tasks and modalities!
https://hezarai.github.io/hezar/
Apache License 2.0
817 stars 44 forks source link

Problem in loading the dataset using the pre-trained model #152

Closed Mostafa79modaqeq closed 2 weeks ago

Mostafa79modaqeq commented 3 months ago

Hello, thanks for your efforts in building this powerful library, I wanted a database completely similar to the database "hezarai/persian-license-plate-v1" I also changed other settings related to path, etc. in config files . When I try to load this dataset with pre-trained model "hezarai/crnn-fa-64x256-license-plate-recognition" (tokenizer_path), a problem occurs. thanks. eval_dataset = Dataset.load(dataset_path, split="test", tokenizer_path = base_model_path)

Downloading data: 100%|██████████| 14.5k/14.5k [00:00<00:00, 39.3kB/s] Downloading data: 100%|██████████| 14.5k/14.5k [00:00<00:00, 31.2kB/s] Downloading data: 100%|██████████| 14.5k/14.5k [00:00<00:00, 37.4kB/s] Generating train split: 2 examples [00:00, 352.12 examples/s] Generating validation split: 2 examples [00:00, 215.20 examples/s] Generating test split: 2 examples [00:00, 502.10 examples/s]

135 for i, sample in enumerate(list(iter(data))): 136 path, text = sample.values() --> 137 if len(text) <= self.config.max_length and is_text_valid(text, self.config.id2label.values()): 138 valid_indices.append(i) [139]...../myenv/Lib/site-packages/hezar/data/datasets/ocr_dataset.py:139) else:

TypeError: object of type 'int' has no len()

arxyzan commented 3 months ago

Hello @Mostafa79modaqeq , thanks for the feedback ❤ As far as I notice, this error can only be caused by the fact that the order of path, text = sample.values() is reversed so that getting the len(text) would raise such error (since text is the index of the sample not the text actually). I think this code can help you check the order of columns:

from datasets import load_dataset

data = load_dataset(dataset_path, split="test")
print(data[0])

The output must be something like below:

{'image_path': 'path/to/image.jpg', 'label': 'label_of_image'}

But yours is probably in reverse order or completely different.

Note that you can also use your own custom dataset class so that everything is in your control. See this example below:

from hezar.models import CRNNImage2TextConfig, CRNNImage2Text
from hezar.preprocessors import ImageProcessor
from hezar.trainer import Trainer, TrainerConfig

from hezar.data import OCRDataset, OCRDatasetConfig

class PersianOCRDataset(OCRDataset):
    def __init__(self, config: OCRDatasetConfig, split=None, **kwargs):
        super().__init__(config=config, split=split, **kwargs)

    def _load(self, split=None):
        # Load a dataframe here and make sure the split is fetched
        data = pd.read_csv(self.config.path)
        # preprocess if needed
        return data

    def __getitem__(self, index):
        # Do anything you want with your data, just make sure that the output must be dictionary of "pixel_values" and "labels"
        path, text = self.data.iloc[index].values()
        pixel_values = self.image_processor(path, return_tensors="pt")["pixel_values"][0]
        labels = self._text_to_tensor(text)
        inputs = {
            "pixel_values": pixel_values,
            "labels": labels,
        }
        return inputs

dataset_config = OCRDatasetConfig(
    path="path/to/csv",
    text_split_type="char_split",
    text_column="label",
    images_paths_column="image_path",
    reverse_digits=True,
)

train_dataset = PersianOCRDataset(dataset_config, split="train")
eval_dataset = PersianOCRDataset(dataset_config, split="test")

model = CRNNImage2Text(
    CRNNImage2TextConfig(
        id2label=train_dataset.config.id2label,
        map2seq_in_dim=1024,
        map2seq_out_dim=96
    )
)
preprocessor = ImageProcessor(train_dataset.config.image_processor_config)

train_config = TrainerConfig(
    output_dir="crnn-plate-fa-v1",
    task="image2text",
    device="cuda",
    batch_size=8,
    num_epochs=20,
    metrics=["cer"],
    metric_for_best_model="cer"
)

trainer = Trainer(
    config=train_config,
    model=model,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    data_collator=train_dataset.data_collator,
    preprocessor=preprocessor,
)
trainer.train()
Mostafa79modaqeq commented 3 months ago

Thanks for the advice, Now I have another question. When I have a structure completely similar to the structure of the Hezar Dataset("hezarai/persian-license-plate-v1"), in order to define separate sets for training and evaluation, I have to give it a separate and dedicated CSV file. In this case, I will implement these conditions in the load() of the PersianOCRDataset class Something similar to the following: csv_files = { "train": "path-of-persian_license_plate_train.csv", "test": "path-of-persian_license_plate_test.csv", "val": "path-of-persian_license_plate_val.csv" } csv_file_path = csv_files.get(split) data = pd.read_csv(csv_file_path) So what is the role of the path in the arguments of OCRDatasetConfig? When I read csv file in the above way (without path in initialize OCRDatasetConfig), this error occurs: preprocessor = ImageProcessor(train_dataset.config.image_processor_config)

 [81](file:/hezar/preprocessors/image_processor.py:81)     Initializes the ImageProcessor.

27 def init(self, config: PreprocessorConfig, **kwargs): [28]preprocessor.py:28) verify_dependencies(self, self.required_backends) # Check if all the required dependencies are installed ---> self.config = config.update(kwargs)

AttributeError: 'NoneType' object has no attribute 'update' It's probably because it can't find a .yaml file for Image_Preprocessor configs or an object that initialize parameters. How can I solve this problem?

arxyzan commented 3 months ago

Hi @Mostafa79modaqeq . Sorry for my late response. (Github does not notify me if I'm not @mentioned in the issues) Your method is actually pretty solid. The only thing is that your dataset needs to receive a image_processor_config object in the config which is an instance of a ImageProcessorConfig dataclass and the previous code I gave you actually misses it too! I don't know how you have defined other parameters in your dataset config but a sample like below would do the trick:

import pandas as pd

from hezar.models import CRNNImage2TextConfig, CRNNImage2Text
from hezar.preprocessors import ImageProcessor, ImageProcessorConfig
from hezar.trainer import Trainer, TrainerConfig

from hezar.data import OCRDataset, OCRDatasetConfig

class PersianOCRDataset(OCRDataset):
    def __init__(self, config: OCRDatasetConfig, split=None, **kwargs):
        super().__init__(config=config, split=split, **kwargs)

    def _load(self, split=None):
        # Load a dataframe here and make sure the split is fetched
        data = pd.read_csv(self.config.path)
        # preprocess if needed
        return data

    def __getitem__(self, index):
        # Do anything you want with your data, just make sure that the output must be dictionary of "pixel_values" and "labels"
        path, text = self.data.iloc[index].values()
        pixel_values = self.image_processor(path, return_tensors="pt")["pixel_values"][0]
        labels = self._text_to_tensor(text)
        inputs = {
            "pixel_values": pixel_values,
            "labels": labels,
        }
        return inputs

dataset_config = OCRDatasetConfig(
    path="path/to/csv",
    text_split_type="char_split",
    text_column="label",
    images_paths_column="image_path",
    reverse_digits=True,
    image_processor_config=ImageProcessorConfig(
        gray_scale=True,
        mean=[0.6595],
        std=[0.1501],
        mirror=True,
        rescale=1/255.0,
        size=(256, 64),
    )
)

train_dataset = PersianOCRDataset(dataset_config, split="train")
eval_dataset = PersianOCRDataset(dataset_config, split="test")

model = CRNNImage2Text(
    CRNNImage2TextConfig(
        id2label=train_dataset.config.id2label,
        map2seq_in_dim=1024,
        map2seq_out_dim=96
    )
)
model.preprocessor = train_dataset.image_processor

train_config = TrainerConfig(
    output_dir="crnn-plate-fa-v1",
    task="image2text",
    device="cuda",
    batch_size=8,
    num_epochs=20,
    metrics=["cer"],
    metric_for_best_model="cer"
)

trainer = Trainer(
    config=train_config,
    model=model,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    data_collator=train_dataset.data_collator,
)
trainer.train()
Mostafa79modaqeq commented 3 months ago

Hello @arxyzan, I sincerely appreciate your responsiveness. I understand how important your responsiveness is in these work conditions. I apologize for my frequent questions due to my lack of experience in programming. I have made the suggested changes and started training. The following error occurs: KeyError.txt training info : Output Directory: crnn-plate-fa-v1 Task: image2text Model: CRNNImage2Text Init Weights: N/A Device(s): cpu Batch Size: 8 Epochs: 20 Training Dataset: PersianOCRDataset(path=ocr['train'], size=7962) Evaluation Dataset: PersianOCRDataset(path=ocr['test'], size=995) Optimizer: adam Scheduler: None Initial Learning Rate: 2e-05 Learning Rate Decay: 0.0 Number of Parameters: 9269001 Number of Trainable Parameters: 9269001 Mixed Precision: Full (fp32) Metrics: ['cer'] Checkpoints Path: crnn-plate-fa-v1\checkpoints Logs Path: crnn-plate-fa-v1\logs\Mar17_12-05-35_DESKTOP-EL4M7VQ

The suggestion that ChatGPT gives me is to change the _text_to_tensor method from the OCRDataset class. Is it correct? Thanks a lot

arxyzan commented 3 months ago

@Mostafa79modaqeq This error occurs since a character (\\u200d) is not present in the list of available labels. You can actually inspect the id2label dictionary:

print(train_dataset.config.id2label)

You can also extract all the desired labels you want from your dataset and pass it in dataset config like below:

...
# Extract id2label from your dataset
labels_set = list(set("".join(df["label"])))
id2label = {i: c for i, c in enumerate(labels_set)}

dataset_config = OCRDatasetConfig(
    path="path/to/csv",
    text_split_type="char_split",
    text_column="label",
    images_paths_column="image_path",
    # 
    id2label=id2label,  # PASS ID2LABEL SO THAT THE KEY ERROR DOES NOT HAPPEN ANYMORE
    # 
    reverse_digits=True,
    image_processor_config=ImageProcessorConfig(
        gray_scale=True,
        mean=[0.6595],
        std=[0.1501],
        mirror=True,
        rescale=1/255.0,
        size=(256, 64),
    )
)
...