tisfeng / Easydict

一个简洁优雅的词典翻译 macOS App。开箱即用,支持离线 OCR 识别,支持有道词典,🍎 苹果系统词典,🍎 苹果系统翻译,OpenAI,Gemini,DeepL,Google,Bing,腾讯,百度,阿里,小牛,彩云和火山翻译。A concise and elegant Dictionary and Translator macOS App for looking up words and translating text.
GNU General Public License v3.0
7.48k stars 378 forks source link

🚀 功能建议:翻译原文处理 #86

Closed tshu-w closed 7 months ago

tshu-w commented 1 year ago

请先确认以下事项

功能描述

翻译原文处理,选择翻译通常会出现翻译原文不在一行,PDF 中以-换行等情况,建议添加类似 Bob 翻译原文处理选项

Screenshot 2023-05-05 at 18 56 53

使用场景(可选)

No response

实现方案(可选)

No response

github-actions[bot] commented 1 year ago

Hello tshu-w, Thank you for your first issue contribution 🎉

tisfeng commented 1 year ago

对翻译原文进行预处理,这个功能感觉可以有。

请问这个具体是什么场景,能给几个具体示例吗

选择翻译通常会出现翻译原文不在一行,PDF 中以-换行等情况

tshu-w commented 1 year ago

对翻译原文进行预处理,这个功能感觉可以有。

请问这个具体是什么场景,能给几个具体示例吗

选择翻译通常会出现翻译原文不在一行,PDF 中以-换行等情况

举例:

  1. 纯文本邮件或别的自动 wrap 的排版(需要将换行转换成空格) Screenshot 2023-05-05 at 21 12 04

    复制后内容:

This paper evaluates the viability of using fixed language models for
training text classification networks on low-end hardware. We combine language
models with a CNN architecture and put together a comprehensive benchmark with
8 datasets covering single-label and multi-label classification of topic,
sentiment, and genre. Our observations are distilled into a list of trade-offs,
concluding that there are scenarios, where not fine-tuning a language model
yields competitive effectiveness at faster training, requiring only a quarter
of the memory compared to fine-tuning.
  1. PDF 文件 (需要将 「- 空格去掉」) Screenshot 2023-05-05 at 21 14 24

    复制后内容:

Methods of machine learning belong to the standard reper- toire of any data analytics endeavour nowadays. However many machine learning algorithms rely on input in the form of dense numerical vectors, which is in stark contrast to the conventional representation of knowledge graphs. To make KGs usable for machine learning tasks Knowledge Graph Embedding approaches are used to encode KG entities (and sometimes relationships) into a lower-dimensional space.
While there are different paradigms of algorithms most embedding approaches score the plausibility of a given tri- ple (h, r, t), i.e. how likely is this statement to be true. The goal of the algorithm is then to compute the embeddings in such a way that positive examples (triples contained in the
tisfeng commented 1 year ago

第二个,去除 PDF 中的 【-空格】,这个我理解了。

第一个,你是用 OCR 取词,它没有处理好换行符吗?还是说,直接在邮件中复制的文本,它带了多余的换行符,需要处理?

tshu-w commented 1 year ago

第一个,你是用 OCR 取词,它没有处理好换行符吗?还是说,直接在邮件中复制的文本,它带了多余的换行符,需要处理?

上面给的例子是直接在邮件中复制文本,另外像 Markdown 或者 LaTeX 中,换行并不代表新的一个段落,需要一个空行才是新起一个段落。还有试了下 OCR 取上面第二章截图也会出现每一行文字都换行的问题。

tisfeng commented 1 year ago

我没用过 LaTeX,对这个不太理解,,如果是下面这种情况,你希望如何对它进行处理?将换成符转成空格?

像 Markdown 或者 LaTeX 中,换行并不代表新的一个段落,需要一个空行才是新起一个段落

tisfeng commented 1 year ago

1.3.0 版本的 OCR 换行处理有时是不对,这个我会逐步优化算法的。

最新的代码已经能处理它了,稍后会发个新版本。

image
tshu-w commented 1 year ago

我没用过 LaTeX,对这个不太理解,,如果是下面这种情况,你希望如何对它进行处理?将换成符转成空格?

像 Markdown 或者 LaTeX 中,换行并不代表新的一个段落,需要一个空行才是新起一个段落

目前想法是和 Bob 一样,「换行符转换成空格」(更高级一点是不是可以由用户设置替换,不过感觉这样有点复杂了)

tisfeng commented 1 year ago

了解了,后面会考虑的。

tisfeng commented 8 months ago

昨天碰到一个「换行符」替换为「空格」的使用场景 https://www.mail-archive.com/xz-devel@tukaani.org/msg00566.html

正好目前快捷动作菜单已完成,这个功能可以安排上了。

Progress will not happen until there is new maintainer. XZ for C has sparse 
commit log too. Dennis you are better off waiting until new maintainer happens 
or fork yourself. Submitting patches here has no purpose these days. The 
current maintainer lost interest or doesn't care to maintain anymore. It is sad 
to see for a repo like this.
image
tisfeng commented 7 months ago

2.7.0 版本已实现该功能。

tshu-w commented 7 months ago

@tisfeng 你好,没有找到设置的位置(上图中的按钮在最新版也没有了),是默认开启么?

tisfeng commented 7 months ago

我记得代码是设置默认开启的,你去设置页看看这个选项。

image
tshu-w commented 7 months ago

感谢,关闭再打开就显示了