tesseract-ocr / tesseract

Tesseract Open Source OCR Engine (main repository)
https://tesseract-ocr.github.io/
Apache License 2.0
61.92k stars 9.47k forks source link

'±' is recognised as '+' #4286

Open DominicMukilan opened 3 months ago

DominicMukilan commented 3 months ago

Current Behavior

No response

Expected Behavior

No response

Suggested Fix

No response

tesseract -v

tesseract v5.4.0.20240606 leptonica-1.84.1 libgif 5.2.1 : libjpeg 8d (libjpeg-turbo 3.0.1) : libpng 1.6.43 : libtiff 4.6.0 : zlib 1.3 : libwebp 1.4.0 : libopenjp2 2.5.2 Found AVX2 Found AVX Found FMA Found SSE4.1 Found libarchive 3.7.4 zlib/1.3.1 liblzma/5.6.1 bz2lib/1.0.8 liblz4/1.9.4 libzstd/1.5.6

Operating System

Windows 11

Other Operating System

No response

uname -a

No response

Compiler

No response

CPU

No response

Virtualization / Containers

No response

Other Information

Tesseract is recognising '±' as '+'. In some places, it doesn't even recognise this.

Python 3.12

stweil commented 3 months ago

Which model / language did you use?

DominicMukilan commented 3 months ago

Python 3.12 IDE Pycharm

reprex: import json import cv2 import pytesseract from PIL import Image import pandas as pd

Load the JSON file

json_path = "new_pred.json" with open(json_path, "r") as file: annotations = json.load(file)

Extract all coordinates without filtering

coordinates = [annotation["box"] for annotation in annotations]

Load the image

image_path = "new_pred.jpg" image = cv2.imread(image_path)

Load the image with PIL to get its dimensions

image_pil = Image.open(image_path) image_width, image_height = image_pil.size

Function to crop image based on coordinates and perform OCR with boundary check

def crop_and_ocr_with_boundary_check(image, coordinates, image_width, image_height): ocr_results = [] skipped_coordinates = [] for i, (x1, y1, x2, y2) in enumerate(coordinates):

Adjust the coordinates to be within image boundaries

    original_coords = (x1, y1, x2, y2)
    x1 = max(0, min(x1, image_width - 1))
    y1 = max(0, min(y1, image_height - 1))
    x2 = max(0, min(x2, image_width))
    y2 = max(0, min(y2, image_height))

    ## Check if the box is too small
    if x2 - x1 < 5 or y2 - y1 < 5:
        skipped_coordinates.append((i, original_coords, "Too small"))
        continue

    ## Crop the region from the image
    cropped_img = image[y1:y2, x1:x2]

    ## Perform OCR on the cropped image
    text = pytesseract.image_to_string(cropped_img)

    ## Append the OCR result
    ocr_results.append({
        "coordinates": (x1, y1, x2, y2),
        "text": text.strip()  # Remove leading/trailing whitespace
    })

return ocr_results, skipped_coordinates

Perform OCR on the annotated regions with boundary check

ocr_results, skipped_coordinates = crop_and_ocr_with_boundary_check(image, coordinates, image_width, image_height)

Convert OCR results to a DataFrame

ocr_df = pd.DataFrame(ocr_results)

Print debugging information

print(f"Total annotations in JSON: {len(annotations)}") print(f"Total OCR results: {len(ocr_results)}") print(f"Skipped coordinates: {len(skipped_coordinates)}") for skip in skipped_coordinates: print(f" Index: {skip[0]}, Coordinates: {skip[1]}, Reason: {skip[2]}")

Display the DataFrame

print(ocr_df)

Optionally, save the results to a CSV file

ocr_df.to_csv("ocr_results.csv", index=False) print("Results saved to ocr_results.csv")

Print image dimensions

print(f"Image dimensions: {image_width}x{image_height}")

stweil commented 3 months ago

Please add also your image (or its URL if it is online) to this issue report.

DominicMukilan commented 3 months ago

Python output:

Total annotations in JSON: 50 Total OCR results: 50 Skipped coordinates: 0 coordinates text 0 (1763, 5732, 2293, 5861)
1 (1785, 5974, 2332, 6064) | 1.314.01 Le 2 (1848, 6119, 2648, 6215)
3 (2901, 4062, 3223, 4164) 03 X 45° 4 (1029, 577, 1510, 665)
5 (8511, 2174, 8895, 2267) 188 6 (6735, 306, 7311, 411) —e| [a—( 188 ) 7 (1732, 3857, 2147, 3941) — w=! 64 ba 8 (3571, 508, 4259, 604) | |e ——_ 188+.003 9 (1069, 1827, 1666, 1940) @D .615+.002\n\nLa 10 (2349, 5867, 2629, 5952)
11 (2409, 3895, 3120, 3987) —e| -— .382+.003 12 (4672, 2200, 5422, 2320) a 2.487+.002 ——= 13 (7402, 3622, 7733, 3817) 30° 14 (8679, 2312, 9175, 2417)
15 (9044, 4051, 9409, 4597)
16 (786, 771, 1275, 853)
17 (1721, 1328, 1869, 1528)
18 (3437, 2321, 3790, 2432) -| 860 19 (2097, 4084, 2295, 4270)
20 (1159, 4032, 1699, 4153) 3 3/4-10 UNS-2A 21 (3918, 4779, 4131, 4931) 2.973 22 (8506, 531, 8895, 626) 1.595+.002 23 (8901, 997, 9284, 1168)
24 (5060, 1791, 5344, 1954)
25 (7401, 2650, 7823, 2751) =| 420 26 (1850, 2369, 2072, 2481) R.125 27 (3355, 651, 3604, 761) ‘a 28 (8916, 3820, 9281, 4032)
29 (1778, 4305, 1924, 4589)
30 (5060, 1715, 5384, 1950) ngle: 0.46 31 (984, 1257, 1145, 1370) a\nA\ 32 (7791, 4801, 8101, 4951)
33 (8217, 2315, 9202, 2415)
34 (2343, 5511, 2888, 5609)
35 (8267, 1997, 8656, 2096)
36 (1462, 1665, 1715, 1764)
37 (433, 1303, 546, 1757)
38 (8384, 1476, 8512, 1830)
39 (1517, 5565, 2035, 5665) 332.01-— | 40 (6247, 3077, 6399, 4105)
41 (4327, 2035, 4901, 2181) (.078 = — 42 (9207, 1376, 9383, 1751)
43 (4671, 4768, 4947, 4940) |\n03.875 44 (4986, 896, 5622, 1195) ay\n> 45 (4886, 1063, 5328, 1186)
46 (4751, 1802, 4996, 1954)
47 (3044, 4667, 3238, 4771) -R.03 48 (8680, 3010, 8963, 3225)
49 (5117, 890, 5620, 1258)
Results saved to ocr_results.csv Image dimensions: 10200x6600 new_pred.json new_pred requirements.txt