[Bug report] Smallest Text Detection elements for Chinese are lines, not characters

craigomac commented 2 years ago

Describe the bug According to the documentation at https://developers.google.com/ml-kit/vision/text-recognition/v2:

an Element is a contiguous set of alphanumeric characters ("word") on the same axis in most Latin languages, or a character in others

In my testing, a TextRecognizer created using ChineseTextRecognizerOptions yields Elements that are whole lines, and not characters.

To Reproduce GoogleOCRDemo.zip

The attached sample app performs recognition on Chinese text and lists each element found, prefixed by a number.

Open the app and tap "Recognize".
After a moment, elements recognized are listed in a scrolling text view. Observe that each element contains multiple characters, not one as the documentation indicates.

Expected behavior I expect each element to represent a single Chinese character. This is very useful for applications where its desirable to enable text selection atop recognised text. It also matches the behaviour of the Tesseract API, and of Apple's OCR frameworks.

SDK Info: pod 'GoogleMLKit/TextRecognitionChinese', '2.6.0'

Smartphone: iPhone 12

Development Environment:

Xcode 13.4.1
macOS 12.4

craigomac commented 2 years ago

Hello—is this the right place to raise issues like this? If not, I'm happy to dupe elsewhere.

miworking3 commented 2 years ago

This is a known issue and fixable, but we haven't planned a release for it yet.

miworking3 commented 1 year ago

I'm afraid with the shift of priority in our team, there is no plan in the near future to address this.

googlesamples / mlkit

[Bug report] Smallest Text Detection elements for Chinese are lines, not characters #535