Closed GoogleCodeExporter closed 9 years ago
Here's a test case for this bug:
Start Drakmagi, and type:
IN
UNDERSZK TRWLQDAN
and you currently get the reply:
Professionella trollkarlar använder sådana lådor för att förvara magiska
föremål.
Kraftfulla besvärjelser har lästs över den, och enbart dess ägare kan komma
åt dess
innehåll.
This works because Z is interpreted as Ö, since Ö holds the slot in the ZSCII
table
which Z normally occupies, and the same goes for W/Ä and Q/Å.
If this bug is fixed, you should be able to type either of these sets of
command and
get the same reply as above (while the commands above should no longer work):
Set 1:
IN
UNDERSÖK TRÄLÅDAN
Set 2:
in
undersök trälådan
Set 3:
TRANSKRIPTION
IN
UNDERSOEK TRAELAADAN
Set 4:
transkription
in
undersoek traelaadan
Original comment by fredrik....@gmail.com
on 26 Jun 2008 at 10:12
Awesome, thanks for the test case, this will make fixing the bug a lot easier.
Original comment by var...@gmail.com
on 26 Jun 2008 at 4:25
I've found the problem in gnusto-engine.js (line 3441-3464 in r114 of gnusto-
engine.js). As expected, it's about encoding player input for the purpose of
matching it against dictionary words. Here it is:
if (ch>=65 && ch<=90) { // A to Z
// These are NOT mapped to A1. ZSD3.7
// explicitly forbids use of upper case
// during encoding.
emit(ch-59);
} else if (ch>=97 && ch<=122) { // a to z
emit(ch-91);
} else {
var z2 = this.m_zalphabet[2].indexOf(String.fromCharCode(ch));
if (z2!=-1) {
if (this.getByte(0)>2) {
emit(5); // shift to weird stuff
} else { emit(3);} //use a shift as 5 is shift_lock in z1-2
emit(z2+6);
} else {
if (this.getByte(0)>2) {
emit(5);
} else { emit(3);} //use a shift as 5 is shift_lock in z1-2
emit(6);
emit(ch >> 5);
emit(ch & 0x1F);
}
The code above actually *assumes* that row 0
consists of a-z and row 1 of A-Z. This is a big mistake, which
many old interpreters have done, since the assumption holds in
Infocom's games. As a side effect, downcasing (which the terp is
obliged to do before encoding player input) will not take place
for characters that are not in the alphabet table at all.
This change should make the code work for all Latin-1 characters and
custom Z-alphabets. More work needs to be done to make the downcasing
work properly for the other Latin-n charsets.
if((ch>=65 && ch<=90) || (ch>=192 && ch<=222 && ch != 215)) {
ch += 32; // Character ch is in upper case and needs downcasing first
}
var z2=this.m_zalphabet[0].indexOf(String.fromCharCode(ch));
if(z2!=-1) { // ch was found in alphabet A0: Just output position + 6
emit(z2+6);
} else {
z2=this.m_zalphabet[1].indexOf(String.fromCharCode(ch));
if(z2!=-1) { // ch was found in alphabet A1. Output a shift character and ch + 6
if (this.getByte(0)>2) {
emit(4); // shift to A1
} else { emit(2);} // shift is positioned differently in z1-2
emit(z2+6);
} else {
z2=this.m_zalphabet[2].indexOf(String.fromCharCode(ch));
if(z2!=-1) { // ch was found in alphabet A2. Output a shift character and ch + 6
if (this.getByte(0)>2) {
emit(5); // shift to A2
} else { emit(3);} // shift is positioned differently in in z1-2
emit(z2+6);
} else {
if (this.getByte(0)>2) {
emit(5);
} else { emit(3);} //shift is positioned differently in z1-2
emit(6);
emit(ch >> 5);
emit(ch & 0x1F);
}
}
}
Additionally, it looks like line 3438 should really read:
while (cursor<str.length && result.length<dictionary_entry_length) {
instead of:
while (cursor<str.length && result.length<6) {
, but I'm not absolutely certain.
I haven't tested any of this, but I'm rather confident that it works. I hope
someone who knows the project and the code well can try this out. See the test
case
I described above.
/Fredrik
Original comment by fredrik....@gmail.com
on 28 Jan 2009 at 7:15
Oh excellent work! I had been looking at this problem but got nowhere. I will
test
and update.
Original comment by curiousdannii
on 28 Jan 2009 at 11:33
Oops, sorry. I made a major mistake there. I was describing downcasing Latin-1,
while I guess the codes that come in here are in fact ZSCII. In that case,
downcasing takes a little more code. Instead of this:
if((ch>=65 && ch<=90) || (ch>=192 && ch<=222 && ch != 215)) {
ch += 32; // Character ch is in upper case and needs downcasing first
}
, you need this:
// Downcase any uppercase characters
if(ch>=65 && ch<=90) {
ch += 32;
} else if((ch>=158 && ch<=160) || (ch>=167 && ch<=168) || (ch>=208 && ch<=210))
{
ch -= 3;
} else if(ch>=175 && ch<=180) {
ch -= 6;
} else if((ch>=186 && ch<=190) || (ch>=196 && ch<=200)) {
ch -= 5;
} else if(ch==217 || ch==218) {
ch -= 2;
} else if(ch==202 || ch==204 || ch==212 || ch==214 || ch==221) {
ch -= 1;
}
That covers the default ZSCII character set. If the game replaces one or more
of the
ZSCII characters 155-255 with other Unicode characters, this won't work. Please
note
though, that this should not be a problem for interpreting regular user input,
since
that should be lowercased as it's read from the keyboard, and I believe the
terp
does that with the Javascript built-in string method toLowerCase (in line 2674
of
gnusto-engine.js, r114). toLowerCase *should* do the job of downcasing any and
all
characters properly.
If a game programmer chooses to put uppercase characters from a character set
that's
not in the default ZSCII table, and then call parse, the terp should downcase
the
characters before encoding them into words to look for in the dictionary, but
this
will fail for such characters. Still, the code is a huge improvement over
what's
currently in the terp.
Ideally, I think the part of the downcasing code above which deals with
characters
155-255 should be ignored if word 3 of the header extension table is present
and has
a non-zero value (See 3.8.5.2 of the Z-machine standard v 1.0), since this
means one
or more characters have been replaced in the table, and we may be creating
mayhem by
trying to downcase it using our normal rules. I'm not quite sure how to check
word 3
of the header extension table in the Gnusto code, so I leave it to someone else
to
add that.
Original comment by fredrik....@gmail.com
on 29 Jan 2009 at 11:24
Okay, I've committed most of your changes in revision 116! Thanks so much!
UNDERSÖK TRÄLÅDAN now works, however UNDERSOEK TRAELAADAN does not. I'm not
sure
related to this bug though, as it has no fancy letters?
In regards to lowercasing, I just added an extra toLowerCase() in, and did no
manual
lowercasing. Checking word three is easy if we ever decide to add the code
back, with
support for all tables.
Original comment by curiousdannii
on 30 Jan 2009 at 11:05
Ok, I see. UNDERSÖK TRÄLÅDAN is lowercased by the toLowerCase method, so the
parse
command doesn't have any lowercasing to do. UNDERSOEK TRAELAADAN should be
changed
to all lowercase by toLowerCase in the aread command, and then there's a
procedure
in the Swedish Inform library which alters it to undersök trälådan. If you
change
TRANSKRIPTION to TRANSKRIPTION VISA in the example, the library should print a
message saying how it was altered. This functionality is there because some
players
don't have Swedish keyboards and will find it easier to type ae instead of ä.
To follow the Z-machine spec, I think you need to do the manual lower-casing as
well - a game author should be able to throw an uppercase word into an array
and ask
the Z-machine to look it up in the dictionary. That's why there was in fact
code
(however rudimentary) to lowercase the characters before looking them up in the
alphabet table.
To do proper, general lower casing at that point, maybe it's easiest to convert
the
character from ZSCII to Unicode, apply the toLowerCase method, and then convert
it
back to ZSCII again?
Anyway, the change you've done so far should be a great improvement.
BTW: Out of the four suggested sets of commands in the initial post, which ones
work
now?
/Fredrik
Original comment by fredrik....@gmail.com
on 30 Jan 2009 at 11:24
I'll give this a further look at sometime (hopefully soon.)
Original comment by curiousdannii
on 31 Jan 2009 at 2:08
Thanks fredrik and dannii! Hmm, we should probably get a regression test going
for
this so it doesn't come back to haunt us...
Original comment by var...@gmail.com
on 1 Feb 2009 at 7:09
Firebug has some testing system now... we should use it if we can.
Original comment by curiousdannii
on 1 Feb 2009 at 10:57
Yeah, I agree, Fireunit looks slick:
http://ejohn.org/blog/fireunit/
Original comment by var...@gmail.com
on 2 Feb 2009 at 1:03
You removed the downcasing completely, which means there may even be games in
English which used to work that won't anymore. That seems dangerous.
For now, how about adding downcasing code for ZSCII codes up to 126 and the
regular
unicode translation table at least? That should cover games in English, German,
French, Italian, Swedish, Dutch and Lojban. I would expect games in Spanish,
Russian
and Slovenian to use alternate tables.
The way it seems to me, this is what the code would look like:
// Downcase any uppercase characters
if(ch>=65 && ch<=90) {
ch += 32;
} else if(ch>154 && this.m_unicode_start == 0) {
// It's an extended character AND the game uses the regular
// unicode translation table, so we know how to downcase.
if((ch>=158 && ch<=160) || (ch>=167 && ch<=168) || (ch>=208 && ch<=210)) {
ch -= 3;
} else if(ch>=175 && ch<=180) {
ch -= 6;
} else if((ch>=186 && ch<=190) || (ch>=196 && ch<=200)) {
ch -= 5;
} else if(ch==217 || ch==218) {
ch -= 2;
} else if(ch==202 || ch==204 || ch==212 || ch==214 || ch==221) {
ch -= 1;
}
}
/Fredrik
Original comment by fredrik....@gmail.com
on 3 Feb 2009 at 7:40
I've figured out why command sets 3 and 4 in Comment 1 above don't work. What
these
cases do, is that they trigger code in the game that look for 'oe' in the input
buffer and replace it with the ZSCII code for 'ö' (and 'ae' => 'ä' and 'aa'
=> 'å').
The game behaves correctly, and it works in Frotz, Windows Frotz, Malyon and
others.
Gnusto actually stores Unicode codes instead of ZSCII codes in the input buffer
when
aread is called. In the standard case, tokenisation still works because they're
converted to ZSCII before being encoded to dictionary words to be compared to
the
ones in the dictionary. However, if a game chooses to alter any of the
characters
and call tokenise again, this approach fails. The game must store ZSCII
character
codes, but the engine only expects Unicode character codes to be in the buffer.
Line 2509 in gnusto-engine.js (r117) proves this. It converts a string which is
has
just read from Z-machine memory from Unicode to ZSCII. That line should be
dropped
completely, and the keyboard read routine should be altered to convert input to
ZSCII before storing it in Z-machine memory instead.
Original comment by fredrik....@gmail.com
on 5 Feb 2009 at 10:10
Added your fix from comment 12 in r118.
Will have a look at what you suggested in comment 13 another day. It's getting
late.
Original comment by curiousdannii
on 5 Feb 2009 at 2:11
I was wrong about 2509 in gnusto-engine.js. It calls a routine named
_into_zscii,
which I thought converted Unicode into ZSCII. In fact, its job is to turn a
word (in
unicode or zsciii - it's treats it as both) into a compressed word that can be
compared to dictionary entries. A better name would perhaps be
_make_dict_lookup_word. Apart from the last paragraph of text, comment 13 is
still
valid.
I've implemented these changes and will somehow commit them very soon. All in
all,
issue 26 and 56 should go away completely.
Original comment by fredrik....@gmail.com
on 6 Feb 2009 at 1:33
This has been fixed in r119. AFAIK, Unicode now works fully.
Issue marked as Fixed.
Original comment by fredrik....@gmail.com
on 7 Feb 2009 at 9:54
Also tested the test case mentioned in comment 2 for issue 62: typing "см"
(synonym
for look in the Russian library) in AllRoadsR.z5, and it does indeed display
the
room description again. So, it looks like Unicode input works even when using a
custom extended character set for ZSCII. If someone can test one or both
available
Russian games more thoroughly, it would be even better!
Original comment by fredrik....@gmail.com
on 8 Feb 2009 at 12:07
Original issue reported on code.google.com by
var...@gmail.com
on 30 May 2008 at 3:44