soshial / xdxf_makedict

XDXF — an open and free dictionary format, that stores word articles in a structural and semantic way. The most convertible format
223 stars 54 forks source link

How to create a bilingual dictionary entry #44

Open boydkelly opened 2 years ago

boydkelly commented 2 years ago

Hi, Thanks for amazing project. I am interested in this but I can't see how to make a bi-lingual entry. In the Rev34.xml file there are 'to' and 'from' language elements in the meda_info. But they indicate to translations of languages en and lv. However in the ar entries I don't see anything identified by lv. And there does seem to be a translation of 'Home", but this appears to be in Russian but with no language specified. Can you point to any other example? Thanks!!!!

soshial commented 1 year ago

Here's an example I created specially for you. I hope it's not too late.

<?xml version="1.0" encoding="UTF-8" ?>
<!DOCTYPE xdxf SYSTEM "xdxf_strict.dtd">
<xdxf revision="034">
    <meta_info>
        <languages>
            <from xml:lang="en"/>
            <to xml:lang="eo"/>
            <to xml:lang="lv-LV"/>
            <to xml:lang="es-ES"/>
            <to xml:lang="zh-cmn-Hant-TW"/>
        </languages>
        <title>Multilingual dictionary</title>
        <full_title>Example of a multilingual dictionary</full_title>
        <description>
            This dictionary shows how to compile a dictionary with one-to-many language translation.
            "k" tag is in English and it is translated into Esperanto, Spanish (Spain) and Spanish (Argentina).
        </description>
        <file_ver>v1.1b</file_ver>
        <creation_date>15-01-2023</creation_date>
        <last_edited_date>15-01-2023</last_edited_date>
    </meta_info>
    <lexicon>
        <ar>
            <k xml:lang="en">cell phone</k>
            <def>
                <def xml:lang="es-ES">
                    <deftext>móvil</deftext>
                </def>
                <def xml:lang="es-AR">
                    <deftext>celular</deftext>
                </def>
                <def xml:lang="eo">
                    <deftext>poŝtelefono</deftext>
                </def>
                <def xml:lang="zh-cmn-Hant-TW">
                    <deftext>手機</deftext>
                </def>
            </def>
        </ar>
    </lexicon>
</xdxf>
soshial commented 1 year ago

I noticed that you have figured out, how to create such entries in your dictionary. Tell if there are cases when my scheme doesn't work/describe well.

boydkelly commented 1 year ago

Yes thank you very much! It is generally working very well. You noticed I added a couple of extra tags to the dtd. This was really for my use case. I needed to uniquely identify definitions, and examples for use with Anki flash cards but has also been useful for importing into neo4j graph database. So I have a uuid field for each definition. So for example (from French to English) verres: glasses;  lunettes: glasses. Ok this is maybe a dumb example its what my brain came up with right now. But there are a lot of situations in the language that I am working with where there is this many to one relationship that happens. Maybe I could have just used the co tag in hindsight. But the uuid has worked well.

Same for examples where the same example phrase may be used with several word definitions contained therein. This enabled me to create a question for each time the example phrase occurs in the dictionary but giving a different 'hint' each time for the phrase meaning.

You also notice I am working with a language in West Africa. This is Jula, which is not super well documented and there are many spelling variations using either phonetics or french phonemes. In addition to the spelling variations, this is a tonal language, so it has been challenging to keep the 'headword' in the local language unique. (I know I can use a comment for that too). But I was tempted to also add a uuid to the ar or k tag.

I have made extensive use of the kref/spv, but now I am dealing with many situations where multiple words share the same kref/spv. This is totally my use case issue but I have awk scripts that search exisiting docs and add http links to the definitions of words. They also search the spv, but may then link back to the wrong definition. This is not at all an xdxf issue but just to let you know my challenges with this.

The one item that I may have found useful is to have source and author tags for the ar (as you have for examples). Again that would not be so useful for well documented languages. In my case I have noted separately where I 'heard' a certain word.

And finally since xml is difficult to work with especially for non technical people I have been working with yaml and converting to xml, (and sometimes back). But the convesions (esp back to yaml) are not perfect.

I'd love to get a conversion going directly from xdxf to neo4j. Its possible. But the easy route for me right now is to do an xslt to csv, and then import that.

I will post back what I may come up with.

Thanks!

soshial commented 1 year ago

You noticed I added a couple of extra tags to the dtd. This was really for my use case. I needed to uniquely identify definitions, and examples for use with Anki flash cards but has also been useful for importing into neo4j graph database. So I have a uuid field for each definition. So for example (from French to English) verres: glasses; lunettes: glasses. Ok this is maybe a dumb example its what my brain came up with right now. But there are a lot of situations in the language that I am working with where there is this many to one relationship that happens. Maybe I could have just used the co tag in hindsight. But the uuid has worked well.

According to the DTD, it is possible to assign IDs to both <k> and <def> via id attribute. I wonder, why you needed to create your own def-id attribute? In the case with "verres: glasses vs lunettes: glasses" both glasses should have a kref tag with idref attribute. I am trying to understand what exact use-case that you didn't use the id attribute and used your own def-id?

Same for examples where the same example phrase may be used with several word definitions contained therein. This enabled me to create a question for each time the example phrase occurs in the dictionary but giving a different 'hint' each time for the phrase meaning.

In the DTD it's not possible to add IDs currently. Would you be so kind to point to specific use cases in Gitlab (like this https://gitlab.com/ci-dict/dyu-xdxf/-/blob/main/mandenkan/dict.xdxf#L29687-29692), where it's needed?

boydkelly commented 1 year ago

According to the DTD, it is possible to assign IDs to both <k> and <def> via id attribute. I wonder, why you needed to create your own def-id attribute? In the case with "verres: glasses vs lunettes: glasses" both glasses should have a kref tag with idref attribute. I am trying to understand what exact use-case that you didn't use the id attribute and used your own def-id?

Its been a while... I remember now I had tried to use that id attribute. But I believe it didn't/wouldn't accept a uuid as a valid tag value. (Which I was already using for my Anki cards) But I'd actually like to use that ID. I'll get back on the exact error there.

boydkelly commented 1 year ago

In the DTD it's not possible to add IDs currently. Would you be so kind to point to specific use cases in Gitlab (like this https://gitlab.com/ci-dict/dyu-xdxf/-/blob/main/mandenkan/dict.xdxf#L29687-29692), where it's needed?

Yes there are lots of examples. But I don't think this is a limitation of the spec. Its really just how I am using the data for question and answer flash cards.

I have an example phrase, "It's not yours!", used in two different definitions: the definition of the word 'not', and also the word 'yours'.

I need to keep them unique so the phrase is presented to the user twice: on one card with a hint providing the meaning of the word 'yours' and on another card with a hint for the word 'not'.

The uuid produced the required results.

 9db7c703-6df2-4bc3-ac29-fb5601417eeb> tá = mien, sien, nôtre, vôtre>i ta tɛ!> Ce n'est pas le tien!>-
  73707939-e5ca-4153-b16c-6f97ed2f7010> tɛ́ = (négation)>i ta tɛ+!>Ce n'est pas le tien+!> 

See: https://coastsystems.net/docs/fr/slides/5words/

soshial commented 1 year ago

On a unrelated note, some comments on your XDXF file:

  1. I believe you might have missed <?xml version="1.0" encoding="UTF-8" ?> and <!DOCTYPE xdxf SYSTEM "xdxf_strict.dtd"> in the beginning of the file.
  2. You might have confused <creation_date> and <last_edited_date>
boydkelly commented 1 year ago

Thanks! You are really keeping me on the ball. Actually since I maintain the dictionary in yaml and convert on every change to xml, I had temporarily commented out those lines. I was doing some re-arranging. I have put them back, but I was validating the DTD via script anyways.

yq -x < "$yml > $dict"

#tidy -q -m -xml -indent $dict-
sed -i '1 i <!DOCTYPE xdxf SYSTEM "xdxf_strict.dtd">' $dict
sed -i '1 i <?xml version="1.0" encoding="UTF-8" ?>' $dict
xmllint --noout --dtdvalid $project/xdxf_strict.dtd ./$dict

 ln -f "$dict ./$project/$xml"

For the dates, yes I was not actually paying to much attention. I will have to automate inserting the current date every time I save or convert my file.

You were asking also about the change I made to categ in the DTD. I am using the categ element for 'tags'. I noted that categ related to wikipedia or something that I thought could be repurposed for my use... The change was so that I could use it as a list element to tag or 'categorize' definitions. (People; Calendar; Work etc) Again this was not so much for the dictionary, but for the Anki cards I produce from the file.

Just as an FYI, I maintain all these scripts in a CI on gitlab, so I make a change to the yaml file it converts to xdxf and updates the dictionary, quiz, anki cards on one shot!

(My neo4j project is temporarily offline. I have to get back to making some adjustments there. )

https://coastsystems.net/docs/fr/lexique-dyu/