tesseract-ocr / tesseract

Tesseract Open Source OCR Engine (main repository)
https://tesseract-ocr.github.io/
Apache License 2.0
62.14k stars 9.5k forks source link

Fooling OCR Systems with Adversarial Text Images [Published Paper] #1700

Open ghost opened 6 years ago

ghost commented 6 years ago

Recently, I have read a research paper called Fooling OCR Systems with Adversarial Text Images Basically, it states that making minor changes to an image could hinder & fool the ocr engine. They used Tesseract 4 as an example.

My question: is there a way to defend against such thing? tess @theraysmith @amitdo @egorpugin @Shreeshrii @stweil

stweil commented 6 years ago

From the paper: "The adversarial examples in this paper were developed for the latest version of Tesseract, a popular open-source OCR system based on deep learning. They do not transfer to the legacy version of Tesseract, which employs character-based recognition".

So using both OCR engines might help against such adversarial text images.

I was not able to reproduce the results from the article. It would be good to get the original images, software version and traineddata which were used.

amitdo commented 6 years ago

My question: is there a way to defend against such thing?

My guess is that you can solve it by having a certain percent of the images in the training dataset that include this filter.

amitdo commented 6 years ago

I was not able to reproduce the results from the article. It would be good to get the original images, software version and traineddata which were used.

CC: @csong27 (Main author of the said paper)

ghost commented 6 years ago

other than exposures "-3 -2 -1 0 1 2 3" what other commands add degradation to the generated images?

amitdo commented 6 years ago

The 'exposures' are meant to be used to train the legacy engine.

For lstm training see #1052.

ghost commented 6 years ago

@amitdo

stweil commented 6 years ago

text2image should already do that:

$ text2image --help|grep -i degrade
  --degrade_image  Degrade rendered image with speckle noise, dilation/erosion and rotation  (type:bool default:true)
amitdo commented 6 years ago

@stweil,

With --degrade_image this function will be called:

// Degrade the pix as if by a print/copy/scan cycle with exposure > 0
// corresponding to darkening on the copier and <0 lighter and 0 not copied.
// If rotation is not nullptr, the clockwise rotation in radians is saved there.
// The input pix must be 8 bit grey. (Binary with values 0 and 255 is OK.)
// The input image is destroyed and a different image returned.
struct Pix* DegradeImage(struct Pix* input, int exposure, TRand* randomizer,
                         float* rotation);

but together with the new lstm code this new function appeared:

// Creates and returns a Pix distorted by various means according to the bool
// flags. If boxes is not nullptr, the boxes are resized/positioned according to
// any spatial distortion and also by the integer reduction factor box_scale
// so they will match what the network will output.
// Returns nullptr on error. The returned Pix must be pixDestroyed.
Pix* PrepareDistortedPix(const Pix* pix, bool perspective, bool invert,
                      bool white_noise, bool smooth_noise, bool blur,
                         int box_reduction, TRand* randomizer,
                         GenericVector<TBOX>* boxes);

Ray said about PrepareDistortedPix() (newer method):

It is used internally at Google. Text2image could be modified to use it too.

I think PrepareDistortedPix() is similar to the degradations methods ocropy uses.

amitdo commented 6 years ago

Do you mean, when training Tesseract 4 lstm there is no use of using exposures "-x+1 ..."?

You can use it, but it seems that the newer method is more suitable for the lstm model.

Currently, there is no code that actually calls the newer method.

ghost commented 6 years ago

Currently, there is no code that actually calls the newer method. :sunglasses: @theraysmith I invoke your presence

amitdo commented 6 years ago

He on the beach right now... :sunglasses:

jbreiden commented 6 years ago

There is exactly one call to PrepareDistortedPix() internally at Google.

amitdo commented 6 years ago

@jbreiden, what's the value of box_reduction?

ghost commented 6 years ago

Any updates regarding Data Augmentation?

ghost commented 6 years ago

Nvidia labs have some interesting implementation of data degradation https://github.com/tmbdev/das2018-tutorial/blob/master/40-augmentation.ipynb https://github.com/NVlabs/ocrodeg https://github.com/NVlabs/ocropus3

Shreeshrii commented 6 years ago

Info about Calamari OCR at https://github.com/tesseract-ocr/tesseract/issues/1782#issuecomment-411018986

Thanks @christophered for the info.

ghost commented 6 years ago

An interesting research paper have been released by Nvidia called Noise2Noise, it shows a new method of cleaning and de-noising images using a model trained on noised images only, not clean ones, it's amazing! This means that such model could actually understand What noise is. @theraysmith @stweil @egorpugin @amitdo Do you see it ever be implemented in Tesseract? https://www.youtube.com/watch?v=P0fMwA3X5KI https://arxiv.org/pdf/1803.04189.pdf https://news.developer.nvidia.com/ai-can-now-fix-your-grainy-photos-by-only-looking-at-grainy-photos/ noise_2

ghost commented 6 years ago

If implemented in Tesseract, this would mean that we wont be needing to add noise or degradation to our data while training, Because Tesseract would already have it's own Model to recognize noise and degradation, Tesseract would understand the concept of noise.