Generate OCR dataset using trdg for Persian

hezarai / hezar

The all-in-one AI library for Persian, supporting a wide variety of tasks and modalities!

https://hezarai.github.io/hezar/

Apache License 2.0

823 stars 45 forks source link

Generate OCR dataset using trdg for Persian #79

Closed arxyzan closed 7 months ago

arxyzan commented 10 months ago

The codes for generating datasets are on this repo https://github.com/hezarai/trdg-persian

arxyzan commented 9 months ago

I generated a 4 million samples dataset for training CRNN. The dataset is so huge in size and unfortunately I couldn't manage to upload it to the Hub yet.

arxyzan commented 7 months ago

I generated another 4 million samples dataset to train our new CRNN model at https://huggingface.co/hezarai/crnn-fa-printed-96-long but the size of the zipped dataset is 12 GB. I have no clue how we can upload such dataset to the Hub given that our network speed is 2MB/s max! I'm labeling this issue as "community help required".

arxyzan commented 7 months ago

I pushed a 200k version of the dataset at https://huggingface.co/datasets/hezarai/parsynth-ocr-200k . The release of the full 4M version is not feasible right now so I'm closing this.