'±' is recognised as '+'

DominicMukilan commented 3 months ago

Current Behavior

No response

Expected Behavior

No response

Suggested Fix

No response

tesseract -v

tesseract v5.4.0.20240606 leptonica-1.84.1 libgif 5.2.1 : libjpeg 8d (libjpeg-turbo 3.0.1) : libpng 1.6.43 : libtiff 4.6.0 : zlib 1.3 : libwebp 1.4.0 : libopenjp2 2.5.2 Found AVX2 Found AVX Found FMA Found SSE4.1 Found libarchive 3.7.4 zlib/1.3.1 liblzma/5.6.1 bz2lib/1.0.8 liblz4/1.9.4 libzstd/1.5.6

Operating System

Windows 11

Other Operating System

No response

uname -a

No response

Compiler

No response

CPU

No response

Virtualization / Containers

No response

Other Information

Tesseract is recognising '±' as '+'. In some places, it doesn't even recognise this.

Python 3.12

stweil commented 3 months ago

Which model / language did you use?

DominicMukilan commented 3 months ago

Python 3.12 IDE Pycharm

reprex: import json import cv2 import pytesseract from PIL import Image import pandas as pd

Load the JSON file

json_path = "new_pred.json" with open(json_path, "r") as file: annotations = json.load(file)

Extract all coordinates without filtering

coordinates = [annotation["box"] for annotation in annotations]

Load the image

image_path = "new_pred.jpg" image = cv2.imread(image_path)

Load the image with PIL to get its dimensions

image_pil = Image.open(image_path) image_width, image_height = image_pil.size

Function to crop image based on coordinates and perform OCR with boundary check

def crop_and_ocr_with_boundary_check(image, coordinates, image_width, image_height): ocr_results = [] skipped_coordinates = [] for i, (x1, y1, x2, y2) in enumerate(coordinates):

Adjust the coordinates to be within image boundaries

    original_coords = (x1, y1, x2, y2)
    x1 = max(0, min(x1, image_width - 1))
    y1 = max(0, min(y1, image_height - 1))
    x2 = max(0, min(x2, image_width))
    y2 = max(0, min(y2, image_height))

    ## Check if the box is too small
    if x2 - x1 < 5 or y2 - y1 < 5:
        skipped_coordinates.append((i, original_coords, "Too small"))
        continue

    ## Crop the region from the image
    cropped_img = image[y1:y2, x1:x2]

    ## Perform OCR on the cropped image
    text = pytesseract.image_to_string(cropped_img)

    ## Append the OCR result
    ocr_results.append({
        "coordinates": (x1, y1, x2, y2),
        "text": text.strip()  # Remove leading/trailing whitespace
    })

return ocr_results, skipped_coordinates

Perform OCR on the annotated regions with boundary check

ocr_results, skipped_coordinates = crop_and_ocr_with_boundary_check(image, coordinates, image_width, image_height)

Convert OCR results to a DataFrame

ocr_df = pd.DataFrame(ocr_results)

Print debugging information

print(f"Total annotations in JSON: {len(annotations)}") print(f"Total OCR results: {len(ocr_results)}") print(f"Skipped coordinates: {len(skipped_coordinates)}") for skip in skipped_coordinates: print(f" Index: {skip[0]}, Coordinates: {skip[1]}, Reason: {skip[2]}")

Display the DataFrame

print(ocr_df)

Optionally, save the results to a CSV file

ocr_df.to_csv("ocr_results.csv", index=False) print("Results saved to ocr_results.csv")

Print image dimensions

print(f"Image dimensions: {image_width}x{image_height}")

stweil commented 3 months ago

Please add also your image (or its URL if it is online) to this issue report.

DominicMukilan commented 3 months ago

Python output:

Total annotations in JSON: 50 Total OCR results: 50 Skipped coordinates: 0 coordinates text 0 (1763, 5732, 2293, 5861)
1 (1785, 5974, 2332, 6064) | 1.314.01 Le 2 (1848, 6119, 2648, 6215)
3 (2901, 4062, 3223, 4164) 03 X 45° 4 (1029, 577, 1510, 665)
5 (8511, 2174, 8895, 2267) 188 6 (6735, 306, 7311, 411) —e| [a—( 188 ) 7 (1732, 3857, 2147, 3941) — w=! 64 ba 8 (3571, 508, 4259, 604) | |e ——_ 188+.003 9 (1069, 1827, 1666, 1940) @D .615+.002\n\nLa 10 (2349, 5867, 2629, 5952)
11 (2409, 3895, 3120, 3987) —e| -— .382+.003 12 (4672, 2200, 5422, 2320) a 2.487+.002 ——= 13 (7402, 3622, 7733, 3817) 30° 14 (8679, 2312, 9175, 2417)
15 (9044, 4051, 9409, 4597)
16 (786, 771, 1275, 853)
17 (1721, 1328, 1869, 1528)
18 (3437, 2321, 3790, 2432) -| 860 19 (2097, 4084, 2295, 4270)
20 (1159, 4032, 1699, 4153) 3 3/4-10 UNS-2A 21 (3918, 4779, 4131, 4931) 2.973 22 (8506, 531, 8895, 626) 1.595+.002 23 (8901, 997, 9284, 1168)
24 (5060, 1791, 5344, 1954)
25 (7401, 2650, 7823, 2751) =| 420 26 (1850, 2369, 2072, 2481) R.125 27 (3355, 651, 3604, 761) ‘a 28 (8916, 3820, 9281, 4032)
29 (1778, 4305, 1924, 4589)
30 (5060, 1715, 5384, 1950) ngle: 0.46 31 (984, 1257, 1145, 1370) a\nA\ 32 (7791, 4801, 8101, 4951)
33 (8217, 2315, 9202, 2415)
34 (2343, 5511, 2888, 5609)
35 (8267, 1997, 8656, 2096)
36 (1462, 1665, 1715, 1764)
37 (433, 1303, 546, 1757)
38 (8384, 1476, 8512, 1830)
39 (1517, 5565, 2035, 5665) 332.01-— | 40 (6247, 3077, 6399, 4105)
41 (4327, 2035, 4901, 2181) (.078 = — 42 (9207, 1376, 9383, 1751)
43 (4671, 4768, 4947, 4940) |\n03.875 44 (4986, 896, 5622, 1195) ay\n> 45 (4886, 1063, 5328, 1186)
46 (4751, 1802, 4996, 1954)
47 (3044, 4667, 3238, 4771) -R.03 48 (8680, 3010, 8963, 3225)
49 (5117, 890, 5620, 1258)
Results saved to ocr_results.csv Image dimensions: 10200x6600 new_pred.json requirements.txt

tesseract-ocr / tesseract

'±' is recognised as '+' #4286

Current Behavior

Expected Behavior

Suggested Fix

tesseract -v

Operating System

Other Operating System

uname -a

Compiler

CPU

Virtualization / Containers

Other Information

Load the JSON file

Extract all coordinates without filtering

Load the image

Load the image with PIL to get its dimensions

Function to crop image based on coordinates and perform OCR with boundary check

Adjust the coordinates to be within image boundaries

Perform OCR on the annotated regions with boundary check

Convert OCR results to a DataFrame

Print debugging information

Display the DataFrame

Optionally, save the results to a CSV file

Print image dimensions