add coco-text as a test/train set

mindee / doctr

docTR (Document Text Recognition) - a seamless, high-performing & accessible library for OCR-related tasks powered by Deep Learning.

https://mindee.github.io/doctr/

Apache License 2.0

3.83k stars 436 forks source link

add coco-text as a test/train set #1131

Open Thomas-MMJ opened 1 year ago

Thomas-MMJ commented 1 year ago

🚀 The feature

You might consider adding COCO-text as one of the supported datasets,

https://vision.cornell.edu/se3/coco-text-2/#download

Motivation, pitch

It is another high quality dataset, text on objects at various angles (sides of vehicles, signs, etc.)

Alternatives

No response

Additional context

No response

felixdittrich92 commented 1 year ago

Hey @Thomas-MMJ 👋 ,

Thanks for the request do you want to add it maybe ? If so im happy to guide you If there is any help needed :)

dvando commented 8 months ago

Hi @felixdittrich92 , has anybody worked on this? I'd love to hop into the project and contribute to this issue. :)

felixdittrich92 commented 8 months ago

Hey @dvando 👋,

No it's still open. Sure feel free to work on it, if you have any questions or need some help contact me :)

dvando commented 6 months ago

Hi @felixdittrich92 , my apology it took me a while to actually work on it, I've been dealing with some issues from work.

I've got some questions about the URLs for download, COCO-text has 2 separate URLs, the first one is for the images, and the second is for the labels, but the VisionDataset only accepts 1 URL which I believe lead to a compressed images and it's labels.

I also checked the other datasets (funsd, cord, synttext, etc), and all of them initialized the VisionDataset using 1 URL only, I was thinking about merging the files myself, but then I was wondering if that's the right thing to do. (Changing the base class should not be an option I believe)

Sorry, and thanks in advance. :)

felixdittrich92 commented 6 months ago

Hi @dvando :smile: No stress ^^

Option 1: You could take a look at https://github.com/mindee/doctr/blob/main/doctr/datasets/imgur5k.py (here the user needs to provide the paths to the data and we provide only the loader) Option 2: What's the dataset size in MB / GB ? What's the license ? If both isn't troublesome we could combine the dataset and upload it :)

dvando commented 6 months ago

So with option 1, the user should download the images and the labels by themself? That sounds okay. The dataset has ~13 GB in size and has CC by 4.0 license.

Both sound fine to me, which one do you prefer @felixdittrich92 ? :)

felixdittrich92 commented 6 months ago

So with option 1, the user should download the images and the labels by themself? That sounds okay. The dataset has ~13 GB in size and has CC by 4.0 license.

Both sound fine to me, which one do you prefer @felixdittrich92 ? :)

Option 1 :+1:

felixdittrich92 commented 6 months ago

As reference PR: https://github.com/mindee/doctr/pull/1359 :)