translate-tools / linguist

Translate web pages, highlighted text, Netflix subtitles, private messages, speak the translated text, and save important translations to your personal dictionary to learn words even offline
https://linguister.io
BSD 3-Clause "New" or "Revised" License
670 stars 21 forks source link

Translate images #310

Open vitonsky opened 1 year ago

vitonsky commented 1 year ago

We have to detect text on images and then translate text

Extensions examples

vitonsky commented 1 year ago

We can do it locally with https://github.com/robertknight/tesseract-wasm

Demo page: https://robertknight.github.io/tesseract-wasm/

vitonsky commented 1 year ago

Another solution https://tesseract.projectnaptha.com/ for alternative way

vitonsky commented 1 year ago

We have to research where we have to find a models for another languages

bbb651 commented 2 months ago

A really compelling use case for this is translation of manga (and comics in general, Japanese -> English is just the most common), there have been a bunch of browsers extensions and tools lately (Ichigo Reader, OneTranslate, Pikrex, ScanTranslator, Ismanga, just some of them that I found and not in any particular order, haven't tried any of them and some of them look a bit sketchy) but afaik there aren't any open source ones. I've been thinking of making my own extension, but if we can work together to implement in linguist it can reduce effort and having a single extension is a better experience from a user perspective. I'm interested what you think, it might still be beneficial to have a separate dedicated extension because of some differences/challenges that I'm not sure linguist can/want to deal with:

I have looked into possible libraries/models that can be used, from my experience tesseract is really not well suited for this (or anything more than scanning printed documents, it cannot deal with rotated text that isn't 90/180/270 degrees and it gets characters wrong very often even in perfect conditions), PaddleOCR looks very promising, probably through ONNX using this Paddle2ONNX, which can run on web, even with WebGPU (I originally tried to use ocr-browser which wrappers the ONNX model, it has this demo, but the modules and typescript setup seem broken in multiple ways and I couldn't even get it to build, we should just use ONNX directly anyway). It can be paired with StyleText to preserve text style. Here is the list of models in English and in Chinese (contains some models missing from the English version, notability the v4 models), I have found ch_PP-OCRv4_det_infer.onnx (detection, for any language) and japan_PP-OCRv3_rec_infer.onnx (recognition, for Japanese) to be the latest models that Paddle2ONNX can convert. Sorry for the info dump (I still have some more resources I need to document, somehow all these amazing and really well documented tools and resources for PaddleOCR are on 0-star github repos that I found by chance), please let me know what you think :)

vitonsky commented 2 months ago

@bbb651 hi, i like that you've mention a potential solutions to implement the feature you mention. It is important part of skills for a strong researcher.

I like your idea, the manga is one of expected use cases, so if you really interesting in work on images translation, you are welcome to be a contributor in Linguist.

We could work together on images translation, i would help you to integrate any solution to Linguist and i'm ready help to understand any details of Linguist implementation. You may start your work with Linguist and if you will decide later that you need another platform to implement some features, you are free to create another extension.

The modular design is one of the key features of Linguist, so we may integrate any modules, and we may develop images translation as standalone module that may be reused anywhere out of Linguist.

About points you mentioned above:

I want to mention about Linguist development specific, to be fair with you

If you are ready to accept this principles, then you are welcome to contributors. In this case you may write me on email, to coordinate our plans and to get known a code base.

bbb651 commented 2 months ago

Great! I’ll probably only be able to make substantial contributions in about a month, I might tackle some smaller additions like a dark mode for the extension pages in the meantime to get familiar with the codebase.

Regarding DOM changes, since we already know the exact bounds of text in the image (and by extension the bounds of translated text), it would be really nice to layer invisible spans with the text to make the text in the image selectable and copyable, like “Live Text” on iOS/macOS. This seems hard to do without risk of breaking pages, since most images are img tags that are void elements, on the other hand if we’re already doing the ocr work, linguist would be the place to do it (with an option of course).

For browser support, unfortunately WebGPU is currently Windows/macOS (non-Linux) Chrome only without enabling developer flags/editions, so we definitely need to have a fallback if we use the WebGPU backend, but I think it’s still worth doing as it’s reasonable to expect support from all major browsers in the future and the performance gain is massive compared to cpu (there’s also a WebGL2 backend that is worth exploring).