Parchment doesn't convert user input to ZSCII properly

GoogleCodeExporter commented 9 years ago

See
http://groups.google.com/group/rec.arts.int-fiction/msg/36129d682a6bbbc0?dmode=s
ource

Original issue reported on code.google.com by var...@gmail.com on 30 May 2008 at 3:44

GoogleCodeExporter commented 9 years ago

Here's a test case for this bug: 

Start Drakmagi, and type:
IN
UNDERSZK TRWLQDAN

and you currently get the reply:

Professionella trollkarlar använder sådana lådor för att förvara magiska 
föremål. 
Kraftfulla besvärjelser har lästs över den, och enbart dess ägare kan komma 
åt dess 
innehåll.

This works because Z is interpreted as Ö, since Ö holds the slot in the ZSCII 
table 
which Z normally occupies, and the same goes for W/Ä and Q/Å.

If this bug is fixed, you should be able to type either of these sets of 
command and 
get the same reply as above (while the commands above should no longer work):

Set 1:
IN
UNDERSÖK TRÄLÅDAN

Set 2:
in
undersök trälådan

Set 3:
TRANSKRIPTION
IN
UNDERSOEK TRAELAADAN

Set 4:
transkription
in
undersoek traelaadan

Original comment by fredrik....@gmail.com on 26 Jun 2008 at 10:12

GoogleCodeExporter commented 9 years ago

Awesome, thanks for the test case, this will make fixing the bug a lot easier.

Original comment by var...@gmail.com on 26 Jun 2008 at 4:25

GoogleCodeExporter commented 9 years ago

I've found the problem in gnusto-engine.js (line 3441-3464 in r114 of gnusto-
engine.js). As expected, it's about encoding player input for the purpose of 
matching it against dictionary words. Here it is:

if (ch>=65 && ch<=90) { // A to Z 
// These are NOT mapped to A1. ZSD3.7 
// explicitly forbids use of upper case 
// during encoding. 
  emit(ch-59); 
} else if (ch>=97 && ch<=122) { // a to z 
  emit(ch-91); 
} else { 
  var z2 = this.m_zalphabet[2].indexOf(String.fromCharCode(ch));  
  if (z2!=-1) { 
    if (this.getByte(0)>2) { 
      emit(5); // shift to weird stuff 
    } else { emit(3);} //use a shift as 5 is shift_lock in z1-2  
    emit(z2+6); 
  } else { 
    if (this.getByte(0)>2) { 
      emit(5); 
    } else { emit(3);} //use a shift as 5 is shift_lock in z1-2 
  emit(6); 
  emit(ch >> 5); 
  emit(ch & 0x1F); 
} 

The code above actually *assumes* that row 0 
consists of a-z and row 1 of A-Z. This is a big mistake, which
many old interpreters have done, since the assumption holds in
Infocom's games. As a side effect, downcasing (which the terp is
obliged to do before encoding player input) will not take place
for characters that are not in the alphabet table at all.

This change should make the code work for all Latin-1 characters and
custom Z-alphabets. More work needs to be done to make the downcasing
work properly for the other Latin-n charsets.

if((ch>=65 && ch<=90) || (ch>=192 && ch<=222 && ch != 215)) {
  ch += 32; // Character ch is in upper case and needs downcasing first
}
var z2=this.m_zalphabet[0].indexOf(String.fromCharCode(ch));
if(z2!=-1) { // ch was found in alphabet A0: Just output position + 6
  emit(z2+6);
} else { 
  z2=this.m_zalphabet[1].indexOf(String.fromCharCode(ch));
  if(z2!=-1) { // ch was found in alphabet A1. Output a shift character and ch + 6
    if (this.getByte(0)>2) { 
      emit(4); // shift to A1 
    } else { emit(2);} // shift is positioned differently in z1-2  
    emit(z2+6);
  } else {
    z2=this.m_zalphabet[2].indexOf(String.fromCharCode(ch));
    if(z2!=-1) { // ch was found in alphabet A2. Output a shift character and ch + 6
      if (this.getByte(0)>2) { 
        emit(5); // shift to A2
      } else { emit(3);} // shift is positioned differently in in z1-2  
      emit(z2+6);
    } else {
      if (this.getByte(0)>2) { 
        emit(5); 
      } else { emit(3);} //shift is positioned differently in z1-2 
      emit(6); 
      emit(ch >> 5); 
      emit(ch & 0x1F);
    }
  }
} 

Additionally, it looks like line 3438 should really read:

while (cursor<str.length && result.length<dictionary_entry_length) {

instead of:

while (cursor<str.length && result.length<6) {

, but I'm not absolutely certain.

I haven't tested any of this, but I'm rather confident that it works. I hope 
someone who knows the project and the code well can try this out. See the test 
case 
I described above.

/Fredrik

Original comment by fredrik....@gmail.com on 28 Jan 2009 at 7:15

GoogleCodeExporter commented 9 years ago

Oh excellent work! I had been looking at this problem but got nowhere. I will 
test
and update.

Original comment by curiousdannii on 28 Jan 2009 at 11:33

GoogleCodeExporter commented 9 years ago

Oops, sorry. I made a major mistake there. I was describing downcasing Latin-1, 
while I guess the codes that come in here are in fact ZSCII. In that case, 
downcasing takes a little more code. Instead of this:

if((ch>=65 && ch<=90) || (ch>=192 && ch<=222 && ch != 215)) {
  ch += 32; // Character ch is in upper case and needs downcasing first
}

, you need this:

// Downcase any uppercase characters 
if(ch>=65 && ch<=90) {
  ch += 32; 
} else if((ch>=158 && ch<=160) || (ch>=167 && ch<=168) || (ch>=208 && ch<=210)) 
{
  ch -= 3;
} else if(ch>=175 && ch<=180) {
  ch -= 6;
} else if((ch>=186 && ch<=190) || (ch>=196 && ch<=200)) {
  ch -= 5;
} else if(ch==217 || ch==218) {
  ch -= 2;
} else if(ch==202 || ch==204 || ch==212 || ch==214 || ch==221) {
  ch -= 1;
}

That covers the default ZSCII character set. If the game replaces one or more 
of the 
ZSCII characters 155-255 with other Unicode characters, this won't work. Please 
note 
though, that this should not be a problem for interpreting regular user input, 
since 
that should be lowercased as it's read from the keyboard, and I believe the 
terp 
does that with the Javascript built-in string method toLowerCase (in line 2674 
of 
gnusto-engine.js, r114). toLowerCase *should* do the job of downcasing any and 
all 
characters properly. 

If a game programmer chooses to put uppercase characters from a character set 
that's 
not in the default ZSCII table, and then call parse, the terp should downcase 
the 
characters before encoding them into words to look for in the dictionary, but 
this 
will fail for such characters. Still, the code is a huge improvement over 
what's 
currently in the terp.

Ideally, I think the part of the downcasing code above which deals with 
characters 
155-255 should be ignored if word 3 of the header extension table is present 
and has 
a non-zero value (See 3.8.5.2 of the Z-machine standard v 1.0), since this 
means one 
or more characters have been replaced in the table, and we may be creating 
mayhem by 
trying to downcase it using our normal rules. I'm not quite sure how to check 
word 3 
of the header extension table in the Gnusto code, so I leave it to someone else 
to 
add that.

Original comment by fredrik....@gmail.com on 29 Jan 2009 at 11:24

GoogleCodeExporter commented 9 years ago

Okay, I've committed most of your changes in revision 116! Thanks so much!

UNDERSÖK TRÄLÅDAN now works, however UNDERSOEK TRAELAADAN does not. I'm not 
sure
related to this bug though, as it has no fancy letters?

In regards to lowercasing, I just added an extra toLowerCase() in, and did no 
manual
lowercasing. Checking word three is easy if we ever decide to add the code 
back, with
support for all tables.

Original comment by curiousdannii on 30 Jan 2009 at 11:05

Changed state: Fixed

GoogleCodeExporter commented 9 years ago

Ok, I see. UNDERSÖK TRÄLÅDAN is lowercased by the toLowerCase method, so the 
parse 
command doesn't have any lowercasing to do. UNDERSOEK TRAELAADAN should be 
changed 
to all lowercase by toLowerCase in the aread command, and then there's a 
procedure 
in the Swedish Inform library which alters it to undersök trälådan. If you 
change 
TRANSKRIPTION to TRANSKRIPTION VISA in the example, the library should print a 
message saying how it was altered. This functionality is there because some 
players 
don't have Swedish keyboards and will find it easier to type ae instead of ä.

To follow the Z-machine spec, I think you need to do the manual lower-casing as 
well - a game author should be able to throw an uppercase word into an array 
and ask 
the Z-machine to look it up in the dictionary. That's why there was in fact 
code 
(however rudimentary) to lowercase the characters before looking them up in the 
alphabet table.

To do proper, general lower casing at that point, maybe it's easiest to convert 
the 
character from ZSCII to Unicode, apply the toLowerCase method, and then convert 
it 
back to ZSCII again?

Anyway, the change you've done so far should be a great improvement.

BTW: Out of the four suggested sets of commands in the initial post, which ones 
work 
now?

/Fredrik

Original comment by fredrik....@gmail.com on 30 Jan 2009 at 11:24

GoogleCodeExporter commented 9 years ago

I'll give this a further look at sometime (hopefully soon.)

Original comment by curiousdannii on 31 Jan 2009 at 2:08

Changed state: Started
Added labels: Priority-High
Removed labels: Priority-Medium

GoogleCodeExporter commented 9 years ago

Thanks fredrik and dannii!  Hmm, we should probably get a regression test going 
for
this so it doesn't come back to haunt us...

Original comment by var...@gmail.com on 1 Feb 2009 at 7:09

GoogleCodeExporter commented 9 years ago

Firebug has some testing system now... we should use it if we can.

Original comment by curiousdannii on 1 Feb 2009 at 10:57

GoogleCodeExporter commented 9 years ago

Yeah, I agree, Fireunit looks slick:

  http://ejohn.org/blog/fireunit/

Original comment by var...@gmail.com on 2 Feb 2009 at 1:03

GoogleCodeExporter commented 9 years ago

You removed the downcasing completely, which means there may even be games in 
English which used to work that won't anymore. That seems dangerous.

For now, how about adding downcasing code for ZSCII codes up to 126 and the 
regular 
unicode translation table at least? That should cover games in English, German, 
French, Italian, Swedish, Dutch and Lojban. I would expect games in Spanish, 
Russian 
and Slovenian to use alternate tables.

The way it seems to me, this is what the code would look like:

// Downcase any uppercase characters 
if(ch>=65 && ch<=90) {
  ch += 32;
} else if(ch>154 && this.m_unicode_start == 0) {
  // It's an extended character AND the game uses the regular 
  // unicode translation table, so we know how to downcase. 
  if((ch>=158 && ch<=160) || (ch>=167 && ch<=168) || (ch>=208 && ch<=210)) {
    ch -= 3;
  } else if(ch>=175 && ch<=180) {
    ch -= 6;
  } else if((ch>=186 && ch<=190) || (ch>=196 && ch<=200)) {
    ch -= 5;
  } else if(ch==217 || ch==218) {
    ch -= 2;
  } else if(ch==202 || ch==204 || ch==212 || ch==214 || ch==221) {
    ch -= 1;
  }
}

/Fredrik

Original comment by fredrik....@gmail.com on 3 Feb 2009 at 7:40

GoogleCodeExporter commented 9 years ago

I've figured out why command sets 3 and 4 in Comment 1 above don't work. What 
these 
cases do, is that they trigger code in the game that look for 'oe' in the input 
buffer and replace it with the ZSCII code for 'ö' (and 'ae' => 'ä' and 'aa' 
=> 'å'). 
The game behaves correctly, and it works in Frotz, Windows Frotz, Malyon and 
others.

Gnusto actually stores Unicode codes instead of ZSCII codes in the input buffer 
when 
aread is called. In the standard case, tokenisation still works because they're 
converted to ZSCII before being encoded to dictionary words to be compared to 
the 
ones in the dictionary. However, if a game chooses to alter any of the 
characters 
and call tokenise again, this approach fails. The game must store ZSCII 
character 
codes, but the engine only expects Unicode character codes to be in the buffer.

Line 2509 in gnusto-engine.js (r117) proves this. It converts a string which is 
has 
just read from Z-machine memory from Unicode to ZSCII. That line should be 
dropped 
completely, and the keyboard read routine should be altered to convert input to 
ZSCII before storing it in Z-machine memory instead.

Original comment by fredrik....@gmail.com on 5 Feb 2009 at 10:10

GoogleCodeExporter commented 9 years ago

Added your fix from comment 12 in r118.

Will have a look at what you suggested in comment 13 another day. It's getting 
late.

Original comment by curiousdannii on 5 Feb 2009 at 2:11

GoogleCodeExporter commented 9 years ago

I was wrong about 2509 in gnusto-engine.js. It calls a routine named 
_into_zscii, 
which I thought converted Unicode into ZSCII. In fact, its job is to turn a 
word (in 
unicode or zsciii - it's treats it as both) into a compressed word that can be 
compared to dictionary entries. A better name would perhaps be 
_make_dict_lookup_word. Apart from the last paragraph of text, comment 13 is 
still 
valid.

I've implemented these changes and will somehow commit them very soon. All in 
all, 
issue 26 and 56 should go away completely.

Original comment by fredrik....@gmail.com on 6 Feb 2009 at 1:33

GoogleCodeExporter commented 9 years ago

This has been fixed in r119. AFAIK, Unicode now works fully.

Issue marked as Fixed.

Original comment by fredrik....@gmail.com on 7 Feb 2009 at 9:54

Changed state: Fixed

GoogleCodeExporter commented 9 years ago

Also tested the test case mentioned in comment 2 for issue 62: typing "см" 
(synonym 
for look in the Russian library) in AllRoadsR.z5, and it does indeed display 
the 
room description again. So, it looks like Unicode input works even when using a 
custom extended character set for ZSCII. If someone can test one or both 
available 
Russian games more thoroughly, it would be even better!

Original comment by fredrik....@gmail.com on 8 Feb 2009 at 12:07

jjxiao / parchment

Parchment doesn't convert user input to ZSCII properly #26