Closed michaelbeijer closed 2 years ago
StarDict is also kinda text-based. The .dict
file is a UTF-8 text file without newlines or any delimiter between entries, so it's hard to search in it. It depends on you search tool. (if you want to find entries, your search tool needs to read .idx
as well).
But it's probably possible to add newline between entries to .dict
file (while updating .idx
file) without breaking the glossary.
Almost all rich-text glossary formats that are supported by dictionary applications use HTML. So we have to convert HTML markup to DSL markup which is the real challenge, and requires a whole new library (that we can use in PyGlossary) I think.
This has been requested before on #262.
Hmm, OK.
By the way, I was also wondering about:
Sdictionary Binary 🔢 .dct ✔
Sdictionary Source 📝 .sdct ✔
(src: https://github.com/ilius/pyglossary)
Apparently, GoldenDict can read:
"SDict dictionaries (.dct)" (i.e. Sdictionary Binary; called "compiled" on http://swaj.net/sdict/create-dicts.html)
I played around with the Sdictionary Source a while back, and actually liked it too. Any chance you could support Writing Sdictionary Source?
We support writing Sdictionary Source. You mean support the binary dct?
Look at this: http://swaj.net/sdict/devel/index.html
The latest version of PTkSdict is 2.0.0rc5
which was created in 2013 (looking at timestamp of files)
And timestamp of Sdict-3.0
says it was last modified in 2007!!
Maybe modification timestamps are wrong! But it's a legacy project and format anyway. It's not worth the effort really. We only support reading it for old legacy glossaries.
Yeah, I meant support writing the binary .dct. However, as you pointed out, it looks pretty dead so not worth your time.
For what it's worth @michaelbeijer, I find converting between stardict and txt/tab format using pyglossary to be a pretty acceptable method for amending and expanding glossaries. This can also be quite a quick and easy way to create a glossary from scratch. Of course, if you only need to edit dictionary metadata, this is even easier as the .ifo file can be read and modified as is with a text editor. I find the latter to be the most common scenario, but your needs may well differ.
Concerning Goldendict++ (as opposed to the original Goldendict), there is some pretty credible speculation detailed in this thread that it is spyware and violating the GPL terms of the original. It may also be modifying your glossaries, according to said discussion.
Edit: clarification on my Goldendict++ comment. At the end of the day the decision to install software lies with the individual user, but that decision should ideally be an informed one.
Any Open Source project could be abused this way. As long as you download it from the official repository or website, you have nothing to worry about. But thanks for letting us know.
Telegram's Android app has been also modified, rebranded and used as spyware where I live. We have to be careful about these rebrandings.
This might help: https://github.com/mortalis13/BGL-To-DSL-Converter
@michaelbeijer DSL is a good format for the reasons you mentioned (e.g., it is one of the few that is actually human-readable and writable as plain text. The basic DSL format is also extremely simple (headword + newline + whitespace + definition = an entry), though it also supports additional features and formatting. I use these dsl-tools to convert a number of different formats into DSL -- see if any of these might be useful to you.
I also would love to see the DSL output! I wonder whether this program, which include a basic program stardict2dsl to convert stardict to dsl, and include some formatting conversion, would help the developer ilius to write a dsl output filter. https://github.com/proteusx/Lectus As the program plan to use dsl as a main format for a cross platform, offline dictionary rendered in html and displayed in browser, a DSL output filter would be particularly useful!
I will not implement this, because it's a lossy conversation and will raise a lot of issues.
It's better to convert to Tabfile, Dictfile, LDF, etc without loosing formatting information.
You can still write new entries in DSL, convert them to one of these text formats and combine the text files.
OK I understand your reasoning. I will seek help writing to the format elsewehere. Thanks for making pyglossary.
Ha ha, so I asked Chat-GPT to create me a little script to convert between csv and dsl, and it did it, and it works. Mind you, this is without any fancy ABBYY Lingvo formatting, but I don't need any of that, just the basic data. Here's the script: https://drive.google.com/file/d/12xOnrcKz9pZcnz1jVuGov1KxFLfoF4ol/view?usp=sharing
Wow that's impressive! I would try it out! It would be great if it could do tags and formatting since the formatting convey so much information that it would be a big waste to discard them.
I'm not familiar with using chatgpt for coding. What is the prompt you used to generate the code above? Would it be possible to modify the prompt to convert some tags? Usually the tags (like bold or italic fonts) are simply html tags and their correspondence with DSL tags are pretty much one to one, so it doesn't look an impossible task!
Thanks again for doing this!
So here is my latest version of the CSV/TSV to DSL converter (still without any formatting).
https://drive.google.com/file/d/1zXsQkOsWDEXPHIB2PYYk-SeMPd1Y9Vdi/view?usp=share_link
To use ChatGPT for coding, you need to get a subscription, select GPT-4 (the new engine), and then be very specific with what you ask. Sadly, I already deleted the relevant chat, but I asked it something like:
"Create a Python script for me to convert from CSV/TSV to (ABBYY Lingvo) DSL, with the following features: a GUI, import of CSV or TSV, etc."
The latest version of my script takes an input (CSV or TSV) file like this:
Dutch English French German Spanish Italian
aan werk bestede tijd, diensttijd time spent at work, length of service temps de présence Anwesenheitszeit tiempo de presencia orario di presenza
aandeel share quote-part Anteil, Rate cuota, parte porporcional quota parte, quota
aandeel stake participation Teilnahme, Beteiligung participación partecipazione
aangesloten lid affiliated member affilié (nm) Mitglieds- afiliado iscritto
aankondiging notice préavis Benachrichtigung preaviso preavviso
aansprakelijkheden liabilities passif Verbindlichkeiten, Schulden pasivo passivo
and produces an output file like this:
#NAME "ADP Human Resources Glossary (nl-en-fr-de-es-it)(nl-en).dsl"
#INDEX_LANGUAGE "Dutch"
#CONTENTS_LANGUAGE "English"
aan werk bestede tijd, diensttijd
English: time spent at work, length of service
French: temps de presence
German: Anwesenheitszeit
Spanish: tiempo de presencia
Italian: orario di presenza
aandeel
English: share
French: quote-part
German: Anteil, Rate
Spanish: cuota, parte porporcional
Italian: quota parte, quota
aandeel
English: stake
French: participation
German: Teilnahme, Beteiligung
Spanish: participacion
Italian: partecipazione
aangesloten lid
English: affiliated member
French: affilie (nm)
German: Mitglieds-
Spanish: afiliado
Italian: iscritto
aankondiging
English: notice
French: preavis
German: Benachrichtigung
Spanish: preaviso
Italian: preavviso
aansprakelijkheden
English: liabilities
French: passif
German: Verbindlichkeiten, Schulden
Spanish: pasivo
Italian: passive
To achieve this I asked ChatGPT something like: Take the name of each column of the CSV/TSV and format the DSL file as follows: [then showed it what I want it to look like]
It is very smart, and can easily add/change features, debug errors, etc. It's crazy really. I'm currently using it to teach myself Python, which is my first proper attempt at learning to code, apart from dabbling in AutoHotkey scripts for years.
PS: I think it should be pretty easy to modify my script to include basic formatting tags. I'll try if I have some free time. You can basically just copy/paste code right into the chat window (as long as your code is not too long, so it doesn't work great for large programs) and ask it to add/change features for you.
Thanks a lot! I don't have the GPT premium account yet, but I'll try to look at the python script to see whether I can understand it! Thank you again for creating this!
Btw, I recently stumbled across this VERY interesting collection of dictionaries and little tools:
https://cloud.freemdict.com/index.php/s/pgKcDcbSDTCzXCs?path=%2F
see e.g. ‘TXT2DSL’ @ https://cloud.freemdict.com/index.php/s/pgKcDcbSDTCzXCs?path=%2FZ%20(Tools%20related%20to%20Electronic%20Dictionaries%20-%20Scripts%20%26%20Software)
Oh that's fascinating! I'll look through them! Thank you so much for pointing them out to me!
Hi there,
It would be amazing to be able to write .dsl files with PyGlossary.
Background: I have a vast collection of small to large glossaries (see: https://beijer.uk/Wordbook.html), and am looking for the best program to manage/search them all in. I currently use LogiTerm Pro. Even though it is extremely powerful (and very expensive), I don't really like it that much. Anyway, so I am looking at GoldenDict currently [incidentally, I prefer the ‘GoldenDict++OCR version’ version, and so would like to settle on a single format that:
(1) GoldenDict can read, (2) PyGlossary can write, and (3) is text based (rather than binary).
This seems to leave me with: (1) ABBYY Lingvo dictionaries (.dsl, .dsl.dz, .lsd) (2) Slob dictionaries (.slob files) (3) a few others
I quite like .dsl, but was wondering how hard it would be for you to implement writing them in PyGlossary.
Michael