mvoelk / ssd_detectors

SSD-based object and text detection with Keras, SSD, DSOD, TextBoxes, SegLink, TextBoxes++, CRNN
MIT License
302 stars 85 forks source link

SL_end2end_predict.ipynb fails on converting to .py with necessary modifications. #52

Closed anindya7 closed 4 years ago

anindya7 commented 4 years ago

Hi, I have ported the SL_end2end_predict.ipynb to a .py file that loads only user images and gets predictions from them.

I am getting an output tensor shape of (1, 5461,18) for a single image. These are the summarized values: [[ 0.9778447 0.02215527 0.05647869 ... 0.39081636 0.2786811 0.7213189 ] [ 0.9876583 0.01234164 0.17384888 ... 0.3960406 0.4638915 0.53610843] [ 0.9857997 0.01420039 0.1859522 ... 0.35515028 0.5405665 0.4594335 ] ... [ 0.99148715 0.00851282 0.8232149 ... 0.02318294 0.9777579 0.02224212] [ 0.99127495 0.00872501 -0.4267429 ... 0.0168435 0.98193043 0.01806954] [ 0.98890346 0.01109659 -0.55322266 ... 0.02168869 0.97793293 0.02206702]] [[0.9778447 0.02215527] [0.9876583 0.01234164] [0.9857997 0.01420039] ... [0.99148715 0.00851282] [0.99127495 0.00872501] [0.98890346 0.01109659]] [[ 0.05647869 0.25386062 -0.7133617 0.8498816 0.20088424] [ 0.17384888 0.65766203 -0.7351028 0.9332367 0.14610018] [ 0.1859522 1.1206706 -0.6635226 0.9163737 0.17239925] ... [ 0.8232149 -1.1342822 -3.839357 -1.5365617 0.0083948 ] [-0.4267429 -0.5418013 -3.4509156 -1.672849 -0.05473372] [-0.55322266 -0.27144217 -0.29643777 -0.03384027 0.6844612 ]] [[0.9789397 0.02106032 0.9673921 ... 0.03583498 0.9666423 0.03335765] [0.98277885 0.01722118 0.9819172 ... 0.02573826 0.97589934 0.02410063] [0.97954214 0.0204578 0.97924906 ... 0.02678417 0.975162 0.02483795] ... [0.97945476 0.02054522 0.9785146 ... 0.01981203 0.97710925 0.02289074] [0.9828745 0.01712552 0.9792094 ... 0.01742494 0.97953975 0.02046019] [0.9779886 0.02201145 0.9780286 ... 0.02184003 0.9782518 0.0217482 ]] [[0.33596185 0.6640381 0.30375123 ... 0.39081636 0.2786811 0.7213189 ] [0.6212505 0.37874946 0.12344692 ... 0.3960406 0.4638915 0.53610843] [0.53112626 0.46887374 0.17611167 ... 0.35515028 0.5405665 0.4594335 ] ... [0.9766356 0.02336443 0.9767829 ... 0.02318294 0.9777579 0.02224212] [0.98007303 0.019927 0.97990566 ... 0.0168435 0.98193043 0.01806954] [0.9779488 0.02205116 0.9778461 ... 0.02168869 0.97793293 0.02206702]] The issue is, in sl_utils.py:304 confs = segment_labels[:,1] Extracts [0.02215527 0.01234164 0.01420039 ... 0.00851282 0.00872501 0.01109659] which do not look like the confidence values. Is my model output incorrect because of the input image?

My input is: `for img_path in glob.glob('./examples_images/*'): img = cv2.imread(img_path) images_orig.append(np.copy(img)) h, w = image_size
resized_img=() resized_img = cv2.resize(img, (w,h),resized_img, cv2.INTER_LINEAR) resized_img = resized_img[:, :, (2,1,0)] / 255 # BGR to RGB images.append(resized_img)

images = np.asarray(images)

preds = det_model.predict(images, batch_size=1, verbose=1) ` Attached my python file. sl_crnn.py.txt

Thank you once again for a very helpful repo. Would appreciate your kind help on this.

mvoelk commented 4 years ago

Your code runs fine for me. The output of the SegLink model should be of shape (1, 5461, 31). Did you change the model?... TF version?

anindya7 commented 4 years ago

The model is untouched. Could you please post your python and tf versions? I shall replicate your environment.

mvoelk commented 4 years ago

Thank you for answering my question with another question... 3.7.5, 2.3.0 ;)

anindya7 commented 4 years ago

With Python 3 the model output shape is indeed (1, 5461, 31). I am also using tensorflow 2.3.0. All the confidences are lower than 0.1 which is unusual.

mvoelk commented 4 years ago

Can you provide the example image? It is quite usual that most of the image is none-text.

anindya7 commented 4 years ago

I tested that it is working for certain images and not for some others. From the attached images, it works for penny_drop.png but not for PANcardmasked.png penny_drop PANcardmasked

mvoelk commented 4 years ago

Is it an option to fine-tune the detector with annotated real world data?

anindya7 commented 4 years ago

Unfortunately annotated data is not available. It seems that the difference in the text between the two images is that the latter does not have a strict edge. The colour of the text 'bleeds' or 'diffuses' into the neighbouring pixels. I shall try preprocessing: normalizing and/or thresholding. That should accentuate the characters.

mvoelk commented 4 years ago

I noticed a similar issue with motion blur on webcam images. Adding Gaussian blur to the data augmentation should fix it, but it requires retraining the models.

You could also try to pad the images. Most of the text instances in the SynthText dataset are smaller.