ilius / pyglossary

A tool for converting dictionary files aka glossaries. Mainly to help use our offline glossaries in any Open Source dictionary we like on any modern operating system / device.
GNU General Public License v3.0
2.23k stars 237 forks source link

Writing ABBYY Lingvo DSL files (.dsl)? #340

Closed michaelbeijer closed 2 years ago

michaelbeijer commented 2 years ago

Hi there,

It would be amazing to be able to write .dsl files with PyGlossary.

Background: I have a vast collection of small to large glossaries (see: https://beijer.uk/Wordbook.html), and am looking for the best program to manage/search them all in. I currently use LogiTerm Pro. Even though it is extremely powerful (and very expensive), I don't really like it that much. Anyway, so I am looking at GoldenDict currently [incidentally, I prefer the ‘GoldenDict++OCR version’ version, and so would like to settle on a single format that:

(1) GoldenDict can read, (2) PyGlossary can write, and (3) is text based (rather than binary).

This seems to leave me with: (1) ABBYY Lingvo dictionaries (.dsl, .dsl.dz, .lsd) (2) Slob dictionaries (.slob files) (3) a few others

I quite like .dsl, but was wondering how hard it would be for you to implement writing them in PyGlossary.

Michael

ilius commented 2 years ago

StarDict is also kinda text-based. The .dict file is a UTF-8 text file without newlines or any delimiter between entries, so it's hard to search in it. It depends on you search tool. (if you want to find entries, your search tool needs to read .idx as well). But it's probably possible to add newline between entries to .dict file (while updating .idx file) without breaking the glossary.

Almost all rich-text glossary formats that are supported by dictionary applications use HTML. So we have to convert HTML markup to DSL markup which is the real challenge, and requires a whole new library (that we can use in PyGlossary) I think.

ilius commented 2 years ago

This has been requested before on #262.

michaelbeijer commented 2 years ago

Hmm, OK.

By the way, I was also wondering about:

Sdictionary Binary 🔢 .dct ✔
Sdictionary Source 📝 .sdct ✔ (src: https://github.com/ilius/pyglossary)

Apparently, GoldenDict can read:

"SDict dictionaries (.dct)" (i.e. Sdictionary Binary; called "compiled" on http://swaj.net/sdict/create-dicts.html)

I played around with the Sdictionary Source a while back, and actually liked it too. Any chance you could support Writing Sdictionary Source?

ilius commented 2 years ago

We support writing Sdictionary Source. You mean support the binary dct?

Look at this: http://swaj.net/sdict/devel/index.html

The latest version of PTkSdict is 2.0.0rc5 which was created in 2013 (looking at timestamp of files) And timestamp of Sdict-3.0 says it was last modified in 2007!!

Maybe modification timestamps are wrong! But it's a legacy project and format anyway. It's not worth the effort really. We only support reading it for old legacy glossaries.

michaelbeijer commented 2 years ago

Yeah, I meant support writing the binary .dct. However, as you pointed out, it looks pretty dead so not worth your time.

hadingtid commented 2 years ago

For what it's worth @michaelbeijer, I find converting between stardict and txt/tab format using pyglossary to be a pretty acceptable method for amending and expanding glossaries. This can also be quite a quick and easy way to create a glossary from scratch. Of course, if you only need to edit dictionary metadata, this is even easier as the .ifo file can be read and modified as is with a text editor. I find the latter to be the most common scenario, but your needs may well differ.

Concerning Goldendict++ (as opposed to the original Goldendict), there is some pretty credible speculation detailed in this thread that it is spyware and violating the GPL terms of the original. It may also be modifying your glossaries, according to said discussion.

Edit: clarification on my Goldendict++ comment. At the end of the day the decision to install software lies with the individual user, but that decision should ideally be an informed one.

ilius commented 2 years ago

Any Open Source project could be abused this way. As long as you download it from the official repository or website, you have nothing to worry about. But thanks for letting us know.

Telegram's Android app has been also modified, rebranded and used as spyware where I live. We have to be careful about these rebrandings.

ilius commented 2 years ago

This might help: https://github.com/mortalis13/BGL-To-DSL-Converter

dohliam commented 2 years ago

@michaelbeijer DSL is a good format for the reasons you mentioned (e.g., it is one of the few that is actually human-readable and writable as plain text. The basic DSL format is also extremely simple (headword + newline + whitespace + definition = an entry), though it also supports additional features and formatting. I use these dsl-tools to convert a number of different formats into DSL -- see if any of these might be useful to you.

jiang-qian commented 2 years ago

I also would love to see the DSL output! I wonder whether this program, which include a basic program stardict2dsl to convert stardict to dsl, and include some formatting conversion, would help the developer ilius to write a dsl output filter. https://github.com/proteusx/Lectus As the program plan to use dsl as a main format for a cross platform, offline dictionary rendered in html and displayed in browser, a DSL output filter would be particularly useful!

ilius commented 2 years ago

I will not implement this, because it's a lossy conversation and will raise a lot of issues.

It's better to convert to Tabfile, Dictfile, LDF, etc without loosing formatting information.

You can still write new entries in DSL, convert them to one of these text formats and combine the text files.

jiang-qian commented 2 years ago

OK I understand your reasoning. I will seek help writing to the format elsewehere. Thanks for making pyglossary.

michaelbeijer commented 1 year ago

Ha ha, so I asked Chat-GPT to create me a little script to convert between csv and dsl, and it did it, and it works. Mind you, this is without any fancy ABBYY Lingvo formatting, but I don't need any of that, just the basic data. Here's the script: https://drive.google.com/file/d/12xOnrcKz9pZcnz1jVuGov1KxFLfoF4ol/view?usp=sharing

2023-04-01_00-55-06

jiang-qian commented 1 year ago

Wow that's impressive! I would try it out! It would be great if it could do tags and formatting since the formatting convey so much information that it would be a big waste to discard them.

I'm not familiar with using chatgpt for coding. What is the prompt you used to generate the code above? Would it be possible to modify the prompt to convert some tags? Usually the tags (like bold or italic fonts) are simply html tags and their correspondence with DSL tags are pretty much one to one, so it doesn't look an impossible task!

Thanks again for doing this!

michaelbeijer commented 1 year ago

@ jiang-qian

So here is my latest version of the CSV/TSV to DSL converter (still without any formatting).

https://drive.google.com/file/d/1zXsQkOsWDEXPHIB2PYYk-SeMPd1Y9Vdi/view?usp=share_link

To use ChatGPT for coding, you need to get a subscription, select GPT-4 (the new engine), and then be very specific with what you ask. Sadly, I already deleted the relevant chat, but I asked it something like:

"Create a Python script for me to convert from CSV/TSV to (ABBYY Lingvo) DSL, with the following features: a GUI, import of CSV or TSV, etc."

The latest version of my script takes an input (CSV or TSV) file like this:

Dutch   English French  German  Spanish Italian
aan werk bestede tijd, diensttijd   time spent at work, length of service   temps de présence   Anwesenheitszeit    tiempo de presencia orario di presenza
aandeel share   quote-part  Anteil, Rate    cuota, parte porporcional   quota parte, quota
aandeel stake   participation   Teilnahme, Beteiligung  participación   partecipazione
aangesloten lid affiliated member   affilié (nm)    Mitglieds-  afiliado    iscritto
aankondiging    notice  préavis Benachrichtigung    preaviso    preavviso
aansprakelijkheden  liabilities passif  Verbindlichkeiten, Schulden pasivo  passivo

and produces an output file like this:

#NAME "ADP Human Resources Glossary (nl-en-fr-de-es-it)(nl-en).dsl"
#INDEX_LANGUAGE "Dutch"
#CONTENTS_LANGUAGE "English"

aan werk bestede tijd, diensttijd
    English: time spent at work, length of service
    French: temps de presence
    German: Anwesenheitszeit
    Spanish: tiempo de presencia
    Italian: orario di presenza

aandeel
    English: share
    French: quote-part
    German: Anteil, Rate
    Spanish: cuota, parte porporcional
    Italian: quota parte, quota

aandeel
    English: stake
    French: participation
    German: Teilnahme, Beteiligung
    Spanish: participacion
    Italian: partecipazione

aangesloten lid
    English: affiliated member
    French: affilie (nm)
    German: Mitglieds-
    Spanish: afiliado
    Italian: iscritto

aankondiging
    English: notice
    French: preavis
    German: Benachrichtigung
    Spanish: preaviso
    Italian: preavviso

aansprakelijkheden
    English: liabilities
    French: passif
    German: Verbindlichkeiten, Schulden
    Spanish: pasivo
    Italian: passive

To achieve this I asked ChatGPT something like: Take the name of each column of the CSV/TSV and format the DSL file as follows: [then showed it what I want it to look like]

It is very smart, and can easily add/change features, debug errors, etc. It's crazy really. I'm currently using it to teach myself Python, which is my first proper attempt at learning to code, apart from dabbling in AutoHotkey scripts for years.

michaelbeijer commented 1 year ago

PS: I think it should be pretty easy to modify my script to include basic formatting tags. I'll try if I have some free time. You can basically just copy/paste code right into the chat window (as long as your code is not too long, so it doesn't work great for large programs) and ask it to add/change features for you.

jiang-qian commented 1 year ago

Thanks a lot! I don't have the GPT premium account yet, but I'll try to look at the python script to see whether I can understand it! Thank you again for creating this!

michaelbeijer commented 1 year ago

Btw, I recently stumbled across this VERY interesting collection of dictionaries and little tools:

https://cloud.freemdict.com/index.php/s/pgKcDcbSDTCzXCs?path=%2F

see e.g. ‘TXT2DSL’ @ https://cloud.freemdict.com/index.php/s/pgKcDcbSDTCzXCs?path=%2FZ%20(Tools%20related%20to%20Electronic%20Dictionaries%20-%20Scripts%20%26%20Software)

jiang-qian commented 1 year ago

Oh that's fascinating! I'll look through them! Thank you so much for pointing them out to me!