Translate images - Githubissues

vitonsky commented 1 year ago

We have to detect text on images and then translate text

Extensions examples

https://chrome.google.com/webstore/detail/translator-ulanguage-tran/mnlohknjofogcljbcknkakphddjpijak

vitonsky commented 1 year ago

We can do it locally with https://github.com/robertknight/tesseract-wasm

Demo page: https://robertknight.github.io/tesseract-wasm/

vitonsky commented 1 year ago

Another solution https://tesseract.projectnaptha.com/ for alternative way

vitonsky commented 1 year ago

We have to research where we have to find a models for another languages

bbb651 commented 2 months ago

A really compelling use case for this is translation of manga (and comics in general, Japanese -> English is just the most common), there have been a bunch of browsers extensions and tools lately (Ichigo Reader, OneTranslate, Pikrex, ScanTranslator, Ismanga, just some of them that I found and not in any particular order, haven't tried any of them and some of them look a bit sketchy) but afaik there aren't any open source ones. I've been thinking of making my own extension, but if we can work together to implement in linguist it can reduce effort and having a single extension is a better experience from a user perspective. I'm interested what you think, it might still be beneficial to have a separate dedicated extension because of some differences/challenges that I'm not sure linguist can/want to deal with:

Partially Translated Pages - It's very likely the websites that host the content already support the target language (e.g. English) for everything but the content, does linguist know to avoid translating already translated text? Doing this can degrade properly translated text (depends on the translator), hurt performance (pointless DOM manipulations), increase api cost if using an online translator, etc.
"Creative" Translation - Translation should fit the narrative and text bubbles should be in character, translations should have as much context as possible about surrounding content in the page (ideally surrounding pages too, but a lot of websites show a single page at a time or load them asynchronously, so it doesn't seem feasible), and the translation itself is more difficult as needs to preserve and adapt nuance, puns, culture, etc. as possible (mostly solved with modular translators systems, but the default setup should try to handle this well)
Marketing - Users don't expect a general translation extension to support this use case, we need to make this clear and get it known within manga circles and communities, especially if we want to compete with the closed source alternatives

I have looked into possible libraries/models that can be used, from my experience tesseract is really not well suited for this (or anything more than scanning printed documents, it cannot deal with rotated text that isn't 90/180/270 degrees and it gets characters wrong very often even in perfect conditions), PaddleOCR looks very promising, probably through ONNX using this Paddle2ONNX, which can run on web, even with WebGPU (I originally tried to use ocr-browser which wrappers the ONNX model, it has this demo, but the modules and typescript setup seem broken in multiple ways and I couldn't even get it to build, we should just use ONNX directly anyway). It can be paired with StyleText to preserve text style. Here is the list of models in English and in Chinese (contains some models missing from the English version, notability the v4 models), I have found ch_PP-OCRv4_det_infer.onnx (detection, for any language) and japan_PP-OCRv3_rec_infer.onnx (recognition, for Japanese) to be the latest models that Paddle2ONNX can convert. Sorry for the info dump (I still have some more resources I need to document, somehow all these amazing and really well documented tools and resources for PaddleOCR are on 0-star github repos that I found by chance), please let me know what you think :)

vitonsky commented 2 months ago

@bbb651 hi, i like that you've mention a potential solutions to implement the feature you mention. It is important part of skills for a strong researcher.

I like your idea, the manga is one of expected use cases, so if you really interesting in work on images translation, you are welcome to be a contributor in Linguist.

We could work together on images translation, i would help you to integrate any solution to Linguist and i'm ready help to understand any details of Linguist implementation. You may start your work with Linguist and if you will decide later that you need another platform to implement some features, you are free to create another extension.

The modular design is one of the key features of Linguist, so we may integrate any modules, and we may develop images translation as standalone module that may be reused anywhere out of Linguist.

About points you mentioned above:

Linguist already can translate pages partially, i like this approach
One of the goals for Linguist is make translation quality as high as possible. We have some technical limitations, but we looking for a ways to solve a problems
It's absolutely fine to mention a use cases in marketing. We already mention Netflix subtitles in descriptions for example, although a Linguist is more than just "subtitles translator". Currently we have issue https://github.com/translate-tools/linguist/issues/464 about landing page that is one more step in marketing company, so we work to make Linguist visible and your feature will be visible for a people too, and you will not lost your time if you will decide to develop it as part of Linguist.

I want to mention about Linguist development specific, to be fair with you

Linguist are modular and ready to introduce a new features, but we really care about quality and product consistency, so we may not to introduce features that are too specific for some site or use case. For example, we may not to patch styles or inject/change DOM elements on any specific site like Youtube, Wikipedia, your favorite manga site and so on. Linguist goal is to be as immersive and useful as possible with as minimal changes on pages as possible. It is dangerous to change DOM or styles on any sites, because we may broke user experience and this changes potentially security dangerous. We may replace one image to another image, and we may redraw text on image any way, but we can't change anything around image to "improve" something, also we should not to change the essence of images (for censorship for example or something like this). If you imagine images translation as DOM changes, then it is not a Linguist way, but is you see it as "replace source image to another image where text will be translated" then it is exactly we want to see in Linguist. Of course, we could optionally add some elements in shadow DOM to control how page will be translated (see how works a translation by text selection).
We fight for a high code quality in Linguist. It means that all code will be reviewed and analyzed, any code that may be tested must be covered with tests (it is fair for any module). Any final decision making must be in writing - we may discuss solutions even in calls, but results of our discussions must be described on github, to make able to debug our decisions. Any code that will be included in Linguist must be a NPM module that will be reviewed or code in Linguist repository, no any remote code executions.
Linguist is not only for Chrome, but for all browsers, so we should implement and test features for all browsers who supports modern extensions API like Firefox. As last resort solution we may implement feature exclusively for one browser, but it is only exceptions, not a standard practice, we have to find a way to make feature work in all browsers.

If you are ready to accept this principles, then you are welcome to contributors. In this case you may write me on email, to coordinate our plans and to get known a code base.

bbb651 commented 2 months ago

Great! I’ll probably only be able to make substantial contributions in about a month, I might tackle some smaller additions like a dark mode for the extension pages in the meantime to get familiar with the codebase.

Regarding DOM changes, since we already know the exact bounds of text in the image (and by extension the bounds of translated text), it would be really nice to layer invisible spans with the text to make the text in the image selectable and copyable, like “Live Text” on iOS/macOS. This seems hard to do without risk of breaking pages, since most images are img tags that are void elements, on the other hand if we’re already doing the ocr work, linguist would be the place to do it (with an option of course).

For browser support, unfortunately WebGPU is currently Windows/macOS (non-Linux) Chrome only without enabling developer flags/editions, so we definitely need to have a fallback if we use the WebGPU backend, but I think it’s still worth doing as it’s reasonable to expect support from all major browsers in the future and the performance gain is massive compared to cpu (there’s also a WebGL2 backend that is worth exploring).

translate-tools / linguist

Translate images #310