tensorflow / datasets

TFDS is a collection of datasets ready to use with TensorFlow, Jax, ...
https://www.tensorflow.org/datasets
Apache License 2.0
4.29k stars 1.54k forks source link

[data request] Adobe Visual Font Recognition (VFR) #431

Open InduManimaran opened 5 years ago

InduManimaran commented 5 years ago

Folks who would also like to see this dataset in tensorflow/datasets, please thumbs-up so the developers can know which requests to prioritize.

And if you'd like to contribute the dataset (thank you!), see our guide to adding a dataset.

sklan commented 5 years ago

@Conchylicultor I'd like to contribute this data set.

Conchylicultor commented 5 years ago

Thank you. Added you as collaborators

SiavasFiroozbakht commented 5 years ago

@sklan thank you for contributing, did you have the chance to look into this already?

@Conchylicultor, I am keen to get this done over the next few days, you may add me as well on the list. Thank you!

sklan commented 5 years ago

@SiavasFiroozbakht I am almost done with adding the BCF version of the data. However, I have not worked with the raw images.

SiavasFiroozbakht commented 5 years ago

@sklan this is very good news. Should we put both versions in the same dataset? If so, would it be possible to share your current scripts so I can integrate the raw images too?

sklan commented 5 years ago

@SiavasFiroozbakht I'll share it once I am done with the BCF.

SiavasFiroozbakht commented 5 years ago

@sklan it would have been useful to see the current structure of your script, I am close to half-way done with the raw images but don't want to make it more difficult to merge after.

sklan commented 5 years ago

@SiavasFiroozbakht https://github.com/sklan/datasets/tree/vfr

SiavasFiroozbakht commented 5 years ago

Thanks @sklan. As AdobeVFR has two types of data (synthetic and real) and two formats (bcf, raw), how do you suggest we implement this so users could select each type?

I am thinking of creating a config for each type of data, but not for the formats. It is likely users won't mind about how the data is stored.

Also, it seems from the folder names in BCF Format that one is for synthetic and the other for real-world images. From your testing so far were you able to see if this is a typo? The updated readme file in Dropbox specifies BCF format is used for synthetic only.

@Conchylicultor would be useful to hear your thoughts as well on how to split the dataset as above, should there be any conventions for this already.

sklan commented 5 years ago

@SiavasFiroozbakht You make a good point. Maybe we should go with a synthetic and real data format.

Also, it seems from the folder names in BCF Format that one is for synthetic and the other for real-world images. From your testing so far were you able to see if this is a typo? The updated readme file in Dropbox specifies BCF format is used for synthetic only.

I believe it is a typo. We could try emailing the creator to clarify.

SiavasFiroozbakht commented 5 years ago

@sklan good to see we are agreeing on that.

Just to ask if you have already tested the BCF code before? I am running some tests and getting error AttributeError: 'PngImageFile' object has no attribute 'read'. Is there any updated version on your side? I have refactored some parts such as the _split_generators method to make it reusable for raw type as well, but it might be better to let you finish with BCF first and then merge.

Also sent an email regarding the folder naming issue. 👍

Conchylicultor commented 5 years ago

If the images are from the same domains (same size, format, labels,...), it probably make sense to have a single dataset with multiple splits ('real_train', 'real_test', 'synthetic_train',...).

Otherwise, it would makes more sense to have two separate configs, or even two different dataset.

It came out to the question: Can the same single model be trained on the synthetic data and evaluated on the real data (without any change to the model) ? If so, then the dataset should contains both synthetic and real data as different splits.

Concerning the format, I'm not familiar with bcl and raw. But if those are images, the tf.features.Image will store them as encoded png files in any case, so the original format do not matter. (The conversion to np.array still happen in the _generate_examples function). The only thing which matter for the user is the output (height, width, channel) decoded image that tfds returns.

I hope this helps. I'm not sure if this answer the questions

SiavasFiroozbakht commented 5 years ago

Any updates on BCF format @sklan? I have got raw images through tfds and was thinking to look into BCF as well if you don't have time, but also don't want to repeat good work already done.

@Conchylicultor thank you for the useful info. In regard to AdobeVFR / DeepFont, its architecture is a bit more special: both the synthetic and real world data are used in training some parts of the architecture either separately or in conjunction (i.e. the Stacked Convolutional Autoencoder, or the other higher-level CNNs). This entails that although the nature of the input is different, they are used complementarily to gain better results for both domains.

In fact, there are for example 5 methods explained in the paper for training the SCAE: these are variations of synthetic training with different levels of augmentation, plus using real-world images only and a combination of these two. From this standpoint, it may better suit to have all these options bundled in one dataset with multiple configs, directly mapped to these variations (so in total 5 configs). How would you both find this approach, given that image augmentation would randomly be generated at runtime with an additional dependency (e.g. imgaug)?

In regard to the possible typo @sklan, Atlas Wang (the author of DeepFont) has responded and confirmed that the VFR_real_test folder in BCF Format does have the correct name, which is a duplicate of the relevant Raw Images subfolder. This means the synthetic data is not split already and we would provide it to the user as-is, for them to segment it as they wish.

sklan commented 5 years ago

@SiavasFiroozbakht Just finished exams. I'll finish it by tomorrow I guess.

SiavasFiroozbakht commented 5 years ago

@sklan hope you smashed them all! Let me know if I can help with anything.

SiavasFiroozbakht commented 5 years ago

@sklan quick update: I managed to get a working version of BCF as well on my side. The error that I mentioned above (invalid argument for .read(). and .seek() operations) were caused by integer overflow. Make sure to use np.int64 instead of np.int32; wanted to mention that so you don't have to spend too much time on debugging it.

@Conchylicultor would be great to still hear your thoughts on the questions above once you have the time. Also, would you mind adding me as a contributor as well so that I can commit these changes together with @sklan?

sklan commented 5 years ago

@sklan hope you smashed them all! Let me know if I can help with anything.

I hope the same. :D

@SiavasFiroozbakht I finished the addition of the synthetic data. We'll open a PR once @Conchylicultor responds.

Conchylicultor commented 5 years ago

In fact, there are for example 5 methods explained in the paper for training the SCAE: these are variations of synthetic training with different levels of augmentation, plus using real-world images only and a combination of these two. From this standpoint, it may better suit to have all these options bundled in one dataset with multiple configs, directly mapped to these variations (so in total 5 configs). How would you both find this approach, given that image augmentation would randomly be generated at runtime with an additional dependency (e.g. imgaug)?

@sklan In this case, having one config for each data augmentation (+1 default config without data augmentation) sound good. Sorry for the late answer

cyfra commented 5 years ago

@sklan - what is the status of this dataset ? (in last update you mentioned that you're about to send the PR).

sklan commented 5 years ago

I think I finished it. I forgot to open a PR.

On Fri, Aug 23, 2019 at 12:01 AM Marcin Michalski notifications@github.com wrote:

@sklan https://github.com/sklan - what is the status of this dataset ? (in last update you mentioned that you're about to send the PR).

sklan commented 5 years ago

It'll take me a few days to respond to all the PRs.

exnx commented 5 years ago

Hi everyone, thanks for all the great work!

I had a few questions:

  1. Is the VFR_real_u data available as BCF as well? If not, when training the stacked conv auotencoder, would you shuffle the synthetic and real data all at the same time, or do one dataset type at a time?
  2. Does anyone if anyone has worked on a Pytorch implementation of DeepFont? Or, I'd love to see someone's TF implementation as well.
  3. I had some questions about the original paper's preprocessing of the data. For the "Squeezing" operation, it looks like all images are have a fixed height of 105, and width of 2.5 * height. Is that right? It seems like it would introduce a good deal of distortion.
  4. After the squeezing operation, the idea is to sample a 105x105 patch randomly from the normalized image?

Thanks guys!

Bobbyphtr commented 4 years ago

Hi everyone, I want to use the dataset for experiment with tensorflow. it says that there’s still an importError, i wonder what is the status of this dataset now.

Python 3.6.9 |Anaconda, Inc.| (default, Jul 30 2019, 13:42:17) 
[GCC 4.2.1 Compatible Clang 4.0.1 (tags/RELEASE_401/final)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import tensorflow_datasets as tfds
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/bobbyphtr/Downloads/datasets-vfr/tensorflow_datasets/__init__.py", line 52, in <module>
    from tensorflow_datasets import image
  File "/Users/bobbyphtr/Downloads/datasets-vfr/tensorflow_datasets/image/__init__.py", line 19, in <module>
    from tensorflow_datasets.image.adobe_vfr import AdobeVfr
ImportError: cannot import name ‘AdobeVfr'

can anyone resolve this problem? @sklan

Thanks guys!

iamtekeste commented 4 years ago

@sklan I would love to know if you need help creating that PR 💙

FBEMPSS commented 3 years ago

when i convert train.bcf to images, a problem appeared. IOError:cannot identify image file <StringIO.StringIO instance at 0x7f3a11798870> the bug occured when the 1687th font was converted.

maryamag85 commented 3 years ago

Is there a VFR dataset somewhere I can use? I am unable to open the dropbox link and nothing is in tensorflow dataset.

Just-Another-AI-Guy commented 3 years ago

The dropbox link for the VFR dataset https://www.dropbox.com/sh/o320sowg790cxpe/AADDmdwQ08GbciWnaC20oAmna?dl=0 shows Error 404. The one available on https://utexas.app.box.com/s/l2uz8pguls37akk1gzsqly59wtoj8jcq is incomplete, it only contains VFR_real_test data with 4385 images whereas VFR_syn_train, VFR_syn_eval and VFR_real_u are missing.

Is there anyplace where the entire dataset is available? If not, can @sklan or @SiavasFiroozbakht upload the dataset, which you might have downloaded before the dropbox link crashed, and share it? It is very urgent as it is a one-of-its-kind dataset for VFR problems.

Prosquid1 commented 2 years ago

Hello, any update on this issue @SiavasFiroozbakht SiavasFiroozbakht, I'm working with your project.

abdul-omneky commented 12 months ago

@FBEMPSS Can you help me with the code snippet that converts the train.bcf to images

Tao-Cute commented 11 months ago

@FBEMPSS Can you help me with the code snippet that converts the train.bcf to images

Did u solve thie problem? I want to convert .bcf format to images too