mpcabd / python-arabic-reshaper

Reconstruct Arabic sentences to be used in applications that don't support Arabic
MIT License
395 stars 81 forks source link

returning char form if needed. #78

Open tahirrafiqueasad opened 2 years ago

tahirrafiqueasad commented 2 years ago

Your work is very impressive and work best for the Urdu language.

I am working on a project in which I have to know the character form. You found character form in you code but there is no specific function to get it. I added one more argument to reshape function. This argument will allow to get character form if needed.

mpcabd commented 2 years ago

Thanks Tahir.

I don't really see the use case, nor do I agree with the implementation - i.e. the extra argument that would change the return type -. Please explain the use case in more detail, and let's think about a better interface implementation to solve the case.

tahirrafiqueasad commented 2 years ago

It has a very important use in the training of the Word Detector (machine learning model to detect the words in image), that is the first module of OCR https://towardsdatascience.com/a-gentle-introduction-to-ocr-ee1469a201aa (Optical Character Recognition) pipeline. Most of the Detector models are trained in the English language, because English words have separate characters. In the case of Urdu and Arabic words, characters are not separate. Inorder to train the model (Specially CRAFT https://github.com/clovaai/CRAFT-pytorch) for the Urdu and Arabic we need character level annotations that are not possible. Instead of getting character level annotation we will get word part level annotation. Example is given below for more understanding:

Character Annotation: [image: Screenshot from 2022-03-09 22-16-53.png]

Part Annotation: [image: Screenshot from 2022-03-09 22-18-48.png]

Word Annotation: [image: Screenshot from 2022-03-09 22-19-48.png]

The work on word level annotation is already done by Adavoudi https://github.com/adavoudi/SynthText. But this word level annotation is not suitable for the training of the model. Character level annotation is good but sometimes character annotations overlap other characters because of connection. So, the best strategy is to use part level annotation. We can make such a type of annotation if we know the start and end character. Here your library comes into play. If your library provides extra information then we will be able to produce part level annotations.

On Tue, Mar 8, 2022 at 3:45 PM Abdullah Diab @.***> wrote:

Thanks Tahir.

I don't really see the use case, nor do I agree with the implementation - i.e. the extra argument that would change the return type -. Please explain the use case in more detail, and let's think about a better interface implementation to solve the case.

— Reply to this email directly, view it on GitHub https://github.com/mpcabd/python-arabic-reshaper/pull/78#issuecomment-1061643751, or unsubscribe https://github.com/notifications/unsubscribe-auth/AVGPPHYS2IQHDKIO65JXVBDU64VWJANCNFSM5PYNRY6A . You are receiving this because you authored the thread.Message ID: @.***>

-- Regards, Muhammad Tahir Rafique