Closed baconbm closed 5 years ago
We should gather a set of sample images that are representative of what is commonly posted on Reddit and include them in the repo as test data.
Our goal is to near-perfectly transcribe screenshots of Twitter posts. Other images can be included as stretch goals.
I've implemented an initial pre-processing function for images. I have one result to share of a twitter image that went through the pre-processing to see how the results of the tesseract worked on it. This is obviously a small sample size but we may be able to have some takeaways from it.
Here is the original image-
Here is the image after being processed -
Here is the text extracted without pre-processing -
And finally here is the text extracted after the pre-processing -
Some takeaways -
After the processing some of the extraneous characters are lost that may cause us difficulty in transcribing. Also after the processing the author of the tweet is completely lost and the date/time of the tweet is lost.
It does not appear that the actual 'meat' of the tweet is changed much at all. This does not surprise me because the text is easily readable in the initial tweet. Where this processing would come in handy is if the text is difficult to make out, the processing will probably improve the readability.
As of now I am not certain whether the processing technique I used is a net positive, we will have to discuss this. Perhaps a different processing technique would produce better results.
Here is a different method of pre-processing the image. In the above I converted the image to greyscale and then processed the pixels one by one assigning them to either black or white depending on which color each pixel was closest to. In the following snip I increased the size of the original image a little bit but did not change the colors of the pixels. -
In my opinion this seems a little bit better even though it captured some garbage at the bottom corresponding to the 'hearts', 'retweets', etc. It did correctly identify the '!' at the end of the tweet which the other technique incorrectly identified as an 'l'.
Interestingly, combining the two techniques (increasing the size and converting to greyscale) produced a worse result than when I tried them both separately.
The second processing method does have pretty good good results. However, I wonder what effect the pre-processing has on images that were very hard for Tesseract to process in the first place. Twitter posts are relatively easy, maybe we could try it on some handwriting or non-screenshots?
I just tested it on some handwriting. It appears that the increase method works the best on transcribing as it did with the twitter post, however it still has a lot of difficulty with handwriting. Here is the original image -
Transcribing with no processing at all -
Transcribing after increasing size -
Describe the bug In some cases the bot is not able to discern what the text is within an image. Either it can only see some of the text within an image or none at all.
To Reproduce Steps to reproduce the behavior:
textifyBot.py
Expected behavior The bot is expected to transcribe the text within an image accurately.
Additional context There are some potential fixes out there. Post processing the image to make the text easier to see against the background is an option. Bluring the background or making the image black and white have produced good results in other projects.