Several issues with pronunciation

IvanUkhov commented 8 years ago

Hello,

The following screenshot demonstrates a number of issues with pronunciation:

the sj sound gets replaced by the dollar sign,
the underscore gets replaced by the plus sign, and
the superscript in the second alternative is not properly formatted.

Regarding the first problem, try to look up words with the tj sound like tjugo; the sound will be erroneously represented by the letter c.

Regards, Ivan

hashier commented 8 years ago

What do you mean by "the superscript in the second alternative is not properly formatted."

IvanUkhov commented 8 years ago

@hashier, there are two possible pronunciations, but accent 2 (grave) is denoted correctly only in the first one. So, it should be like this (just as on Folkets lexikon):

[²sj'o:r_ta el. ²sj'or_t:a]

Note the second “²”. While we’re on it, why have you decided to denote stress by capitalizing letters instead of using the traditional notation? Thanks!

IvanUkhov commented 8 years ago

@hashier, sorry for picking on details. I just think that pronunciation is the most important part of the language, and it’s also the one that is the most difficult to master. It’s of great help to be able to clearly see how to pronounce words. I wish Dictionary had sound.

hashier commented 8 years ago

Ah, I see what you mean with the 2.

I didn't pick anything, I used use what was in dataset that I got from folkets lexikon. Since I always find it hard to read anyway I never realised that it is completely wrong (:

I checked what the "data" is for the pronunciation of skjorta

<word class="nn" lang="sv" value="skjorta">
   <translation value="shirt" />
   <phonetic soundFile="skjorta.swf" value="²$O:r+ta el. 2$Or+t:a" />
[...]

Seems like it's already broken in that file so I guess there is nothing we can do to fix it :

IvanUkhov commented 8 years ago

Hmm, the interesting thing is that their web interface pulls data from the same database, and this “broken” representation is exactly what it gets to work with. For instance, here is the server’s response for “skjorta”:

//OK[6,0,0,1,5,4,2,3,0,0,1,2,2,0,0,1,["se.algoritmica.folkets.client.LookUpResult/1089098233","[I/2970817851","[Ljava.lang.String;/2600011424","<word class=\"nn\" date=\"2011-03-03\" id=\"158400\" lang=\"sv\" lexinid=\"15841\" origin=\"lexin\" value=\"skjorta\"><translation date=\"2011-03-03\" id=\"15559\" value=\"shirt\"></translation><phonetic date=\"2011-03-03\" soundFile=\"skjorta.swf\" value=\"²$O:r+ta el. 2$Or+t:a\"></phonetic><paradigm date=\"2011-03-03\" id=\"13806\" origin=\"lexin\"><inflection value=\"skjortan\"></inflection><inflection value=\"skjortor\"></inflection></paradigm><see date=\"2011-03-03\" origin=\"saldo\" type=\"saldo\" value=\"skjorta||skjorta..1||skjorta..nn.1\"></see><compound date=\"2011-03-03\" id=\"5537\" value=\"bomullsskjorta\"><translation value=\"cotton shirt\"></translation></compound><compound date=\"2011-03-03\" id=\"5538\" inflection=\"skjort|kragen\" value=\"skjort|krage\"><translation value=\"shirt collar\"></translation></compound><idiom date=\"2011-03-03\" id=\"1358\" value=\"kosta skjortan (&amp;quot;kosta väldigt mycket&amp;quot;)\"><translation value=\"cost a packet (&amp;quot;cost very much&amp;quot;)\"></translation></idiom><definition date=\"2011-03-03\" id=\"15341\" value=\"ett tunnare klädesplagg med krage, ärmar och knäppning fram\"></definition><url date=\"2011-03-03\" origin=\"lexin\" type=\"any\" value=\"8/herr.swf\"></url></word>","<word value=\"shirt\" lang=\"en\" class=\"nn\" id=\"379721\" origin=\"lexin\" date=\"2009-02-24\"><translation id=\"379721-1\" value=\"skjorta\" origin=\"lexin\" date=\"2009-02-24\"></translation><example id=\"379721-2\" value=\"Tom put on a clean white shirt and a tie.\" origin=\"lexin\" date=\"2009-02-24\"><translation value=\"Tom satte på sig en ren vit skjorta och en slips.\" origin=\"lexin\" date=\"2009-02-24\"></translation></example><explanation value=\"A piece of clothing with collar, sleeves and buttons down the front.\" origin=\"lexin\" date=\"2009-02-24\"></explanation></word>","skjorta"],0,7]

If you scroll to the right, you’ll see exactly what you wrote above. So, I guess, there’s some post-processing on the client side that makes it look pretty.

hashier commented 8 years ago

How did you make the request?

I don't think they do post processing, I assume their DB -> XML is "broken" and their homepage is not using DB -> XML but something else. If they of course use the same interface that you used then I have no idea how they fix it.

IvanUkhov commented 8 years ago

In you have Chrome,

open the skjorta page,
open the Developer tools,
go to the Network tab,
reload the page,
select “lookupword” in the list on the left-hand side, and
go to the Response tab.

I’m not claiming that that’s how they do it; I really have no idea. Maybe there’s some other mechanism, which is intentionally hidden.

hashier commented 8 years ago

nah, that's just us talking to their web server, that's not how they talk internally to their DB.

But interesting that it shows the correct stuff on the homepage in the end... maybe reading their javascript of handling the response might solve this problem but who's got time to do that (:

IvanUkhov commented 8 years ago

I’ve found the code that seems to be doing the translation. Unfortunately, it’s heavily obfuscated and pretty much useless:

function Rbb(b) {
    var i, j;
    Arb[i = ++Brb] = Rbb;
    Crb[i] = KXb + sMb, Fbb();
    var c, d, e, f, g;
    f = new(Crb[i] = KXb + ZRb, pZ)(utb);
    g = OY((Crb[i] = KXb + '547', b));
    for (Crb[i] = KXb + SEb, d = 0, e = g.length;
        (Crb[i] = KXb + SEb, d) < e; Crb[i] = KXb + SEb, ++d) {
        c = g[d];
        switch (Crb[i] = KXb + TEb, c) {
            case 50:
                Crb[i] = dvb + ttb, (Crb[i] = tTb + Vwb, (Crb[i] = KXb + tCb, f).a).a += '\xB2';
                break;
            case 43:
                Crb[i] = dvb + ttb, (Crb[i] = tTb + Vwb, (Crb[i] = KXb + WEb, f).a).a += '_';
                break;
            case 64:
                Crb[i] = dvb + ttb, (Crb[i] = tTb + Vwb, (Crb[i] = KXb + wCb, f).a).a += 'ng';
                break;
            case 99:
                Crb[i] = dvb + ttb, (Crb[i] = tTb + Vwb, (Crb[i] = KXb + yCb, f).a).a += 'tj';
                break;
            case 36:
                Crb[i] = dvb + ttb, (Crb[i] = tTb + Vwb, (Crb[i] = KXb + XEb, f).a).a += 'sj';
                break;
            case 65:
                Crb[i] = dvb + ttb, (Crb[i] = tTb + Vwb, (Crb[i] = KXb + jZb, f).a).a += "'a";
                break;
            case 69:
                Crb[i] = dvb + ttb, (Crb[i] = tTb + Vwb, (Crb[i] = KXb + zCb, f).a).a += "'e";
                break;
            case 73:
                Crb[i] = dvb + ttb, (Crb[i] = tTb + Vwb, (Crb[i] = KXb + aFb, f).a).a += "'i";
                break;
            case 79:
                Crb[i] = dvb + ttb, (Crb[i] = tTb + Vwb, (Crb[i] = KXb + aGb, f).a).a += "'o";
                break;
            case 85:
                Crb[i] = dvb + ttb, (Crb[i] = tTb + Vwb, (Crb[i] = KXb + xOb, f).a).a += "'u";
                break;
            case 89:
                Crb[i] = dvb + ttb, (Crb[i] = tTb + Vwb, (Crb[i] = KXb + NEb, f).a).a += "'y";
                break;
            case 197:
                Crb[i] = dvb + ttb, (Crb[i] = tTb + Vwb, (Crb[i] = KXb + _Rb, f).a).a += "'\xE5";
                break;
            case 196:
                Crb[i] = dvb + ttb, (Crb[i] = tTb + Vwb, (Crb[i] = KXb + dMb, f).a).a += "'\xE4";
                break;
            case 214:
                Crb[i] = dvb + ttb, (Crb[i] = tTb + Vwb, (Crb[i] = KXb + PRb, f).a).a += "'\xF6";
                break;
            default:
                Crb[i] = dvb + ptb, (Crb[i] = tTb + Lwb, (Crb[i] = KXb + fGb, f).a).a += (Crb[i] = xub + kBb, (Crb[i] = xub + kBb, String).fromCharCode((Crb[i] = KXb + fGb, c)));
        }
    }
    j = (Crb[i] = dvb + Gtb, (Crb[i] = tTb + _vb, (Crb[i] = KXb + ePb, f).a).a);
    Brb = i - 1;
    return j
}

IvanUkhov commented 8 years ago

That code indeed works. Here is a more human-friendly version:

var mapping = {
  50: '\xB2',
  43: '_',
  64: 'ng',
  99: 'tj',
  36: 'sj',
  65: "'a",
  69: "'e",
  73: "'i",
  79: "'o",
  85: "'u",
  89: "'y",
  197: "'\xE5",
  196: "'\xE4",
  214: "'\xF6",
};

function translate(text) {
  var buffer = "";
  for (var i = 0, length = text.length, next; i < length; i++) {
    next = mapping[text[i].charCodeAt(0)];
    if (next == undefined) {
        next = text[i];
    }
    buffer += next;
  }
  return buffer;
}

hashier commented 8 years ago

wow! Batshit crazy! This was really something to get the obfuscated code to something like this simple! <3 love it

hashier / MacFolket

Several issues with pronunciation #7