mras0 / sasm

Self-hosting Real Mode x86 Assembler
MIT License
33 stars 3 forks source link

Support for 'h' suffix. #1

Open rogueeve575 opened 4 months ago

rogueeve575 commented 4 months ago

A lot of my existing nasm code is using hex numbers in the format with the trailing 'h' (e.g. "0100h" instead of "0x100").

I don't know if this project is actively maintained, but it would be nice to have support for that alternate number format. I don't want to go through the whole forking process to be able to submit a pull request, but I did put together this little mod to add that in which seems to work fine as a potential starting place (I only bothered with SASM.COM, I'm sure doing this in the C version is fairly trivial):


diff --git a/sasm.asm b/sasm.asm
index 1e57006..9766700 100644
--- a/sasm.asm
+++ b/sasm.asm
@@ -833,11 +833,32 @@ Expect:

 ; Get number from Token to AX
 GetTokenNumber:
+               xor dl, dl                                      ; assume no "h" suffix
         cmp word [UToken], '0X'
         je .Hex
-        ; Decimal number
+
+               ; check if it has h suffix though (e.g. "0100h")
         xor ax, ax
         mov bx, UToken
+
+               mov al, [TokenLen]                      ; read length of token
+               add bx, ax                                      ; move past end of token
+               dec bx                                          ; back to point at last char
+               mov al, [bx]                            ; fetch last char of token
+
+               cmp al, 'H'                                     ; has "h" suffix?
+               jne .decimal                            ; if not, treat as decimal
+
+               ; hex number with 'h' suffix
+               inc dl                                          ; don't skip 2 chars for "0x" below
+        mov cl, [TokenLen]                     ; decrement len to process
+               dec cl                                          ; take off the 'h'
+               jmp short .Hex2
+
+.decimal:
+        ; Decimal number
+        mov bx, UToken
+               xor al, al
         xor ch, ch
 .Dec:
         mov cl, [bx]
@@ -858,11 +879,17 @@ GetTokenNumber:
 .Hex:
         mov cl, [TokenLen]
         sub cl, 2
+.Hex2:
         cmp cl, 4
         ja .Error
-        xor ax, ax
+
         mov bx, UToken
+               test dl, dl
+               jnz .noskip0x
         add bx, 2
+.noskip0x:
+
+        xor ax, ax
 .HexCvt:
         shl ax, 1
         shl ax, 1
mras0 commented 4 months ago

Proper intel syntax hex literals (0nnnnH) are a good idea to support! Never really liked the syntax, so I just always used 0xnnnn as allowed by nasm.

To be honest, I haven't touched the code in a number of years, but the code is maintained in the sense that I've so far responded to every issue :)

Your way seems fine, but just to jog my memory of how things work, I've done it slightly differently and added a test: https://github.com/mras0/sasm/tree/HexSuffix

If that works for you, I'll merge it into master

rogueeve575 commented 4 months ago

Cool, I like how you did it and it does assemble my test cases just as well as my version. If I had any nitpicky at all which I really don't, it would be around needing to push/pop si which I was trying to avoid needing to use the stack for speed, but as a result you end up fetching a lot less instructions than I did so I'm sure it's a win.

I may have been being a little overly cautious in mine; I wasn't certain if I could do the "fancy" addressing modes ("cmp byte [bx+si-1]") on an 8086/80286. I don't have an 8086 but I did test your patch on a real 286 and it's fine, plus I see there's a few other similar uses of those modes already in the program so this is absolutely probably a better way to get a pointer to the last char.

Not being fully familiar with the program I also wasn't 100% sure if it was okay to corrupt the dl register in that subroutine (that I used to signal it was the 'h' form so as not to skip two chars for the "0x"). Your version doesn't need to do that though so doesn't really matter anymore. It's great for the original author to do it because even if you don't 100% remember everything you'll probably have a better idea of stuff like that than I would just due to knowing your own coding style, at least from when I've looked at old projects of mine that's how it seemed.

I'll admit I don't fully understand the part in lines 867-873, I thought this was to deal with leading zeroes somehow, but I can't tell any difference any behavior from my version that was missing that part, maybe it's just an optimization for the number "0h" / "00h":

.HexSuffix:
        dec cl
        cmp byte [bx], '0'
        jne .HexCheckLen
        inc bx
        dec cl
        jnz .HexCheckLen
        ret

Your C version is much cleaner than the lazy hack I made for that after the OP as well. So overall looks great to me.

rogueeve575 commented 4 months ago

BTW I also noticed that sasm doesn't support the "jle" and "jge" instructions. These are just synonyms for the existing "jng" and "jnl" mnemonics so are really easy to just add to the table:

 ▶ git diff sasm.asm sasm.c
diff --git a/sasm.asm b/sasm.asm
index 0ff60e9..91a920e 100644
--- a/sasm.asm
+++ b/sasm.asm
@@ -3456,8 +3456,12 @@ DispatchList:
     dw InstJCC
     db 'JL',0,0,0,0, CC_L
     dw InstJCC
+    db 'JGE',0,0,0,  CC_NL
+    dw InstJCC
     db 'JNL',0,0,0,  CC_NL
     dw InstJCC
+    db 'JLE',0,0,0,  CC_NG
+    dw InstJCC
     db 'JNG',0,0,0,  CC_NG
     dw InstJCC
     db 'JG',0,0,0,0, CC_G
diff --git a/sasm.c b/sasm.c
index b73deb2..4d6dad6 100644
--- a/sasm.c
+++ b/sasm.c
@@ -1919,8 +1919,10 @@ static const struct {
     { "JPE"    , &InstJcc         , JPE  },
     { "JPO"    , &InstJcc         , JPO  },
     { "JL"     , &InstJcc         , JL   },
+    { "JGE"    , &InstJcc         , JNL  },
     { "JNL"    , &InstJcc         , JNL  },
     { "JNG"    , &InstJcc         , JNG  },
+    { "JLE"    , &InstJcc         , JNG  },
     { "JG"     , &InstJcc         , JG   },
     };

I was going to add support for "xlat" as well, but upon starting I noticed that sasm does already implement it as "xlatb", and documentation seems to indicate that's actually correct, so even though I've always seen it written just "xlat" in every assembler and debugger I've used before, I'm unsure whether or not it'd be appropriate to just add "xlat" as a synonym to "xlatb" as well.

From what I understand apparently Intel intended you to give some bogus argument to the "xlat" form for "documentation" that's then completely ignored and bx used instead, maybe something like "xlat [table]". Seems like a terribly dumb idea to me from a maintainability standpoint, when you could just put a comment instead, but yeah. Another source says it allows for an segment override, so I guess something like "xlat [es:bx]", which at least could be potentially useful although I'm not sure if that's true.

However, I'm guessing as to Intel's intended syntax as I can't find any actual complete examples, and I can't get any of these forms to work in nasm, only plain "xlat" with no argument as I've usually seen, which actually appears incorrect from the documentation, so, lol....

mras0 commented 4 months ago

Cool, I like how you did it and it does assemble my test cases just as well as my version. If I had any nitpicky at all which I really don't, it would be around needing to push/pop si which I was trying to avoid needing to use the stack for speed, but as a result you end up fetching a lot less instructions than I did so I'm sure it's a win.

I may have been being a little overly cautious in mine; I wasn't certain if I could do the "fancy" addressing modes ("cmp byte [bx+si-1]") on an 8086/80286. I don't have an 8086 but I did test your patch on a real 286 and it's fine, plus I see there's a few other similar uses of those modes already in the program so this is absolutely probably a better way to get a pointer to the last char.

To be honest, I didn't profile it, and your version was fine as well, just took the opportunity to reorganize the code a bit (after spending a bit of time remembering how it works..). In general I wrote the assembler incrementally with ease of implementation in mind rather than speed. The linear scan of the DispatchList is not great for speed (it should really be a hash table).

You can't use fancy addressing modes, but bx+si is one of the non-fancy ones :) (I find this page a pretty convenient reference: http://ref.x86asm.net/coder.html#modrm_byte_16)

Not being fully familiar with the program I also wasn't 100% sure if it was okay to corrupt the dl register in that subroutine (that I used to signal it was the 'h' form so as not to skip two chars for the "0x"). Your version doesn't need to do that though so doesn't really matter anymore. It's great for the original author to do it because even if you don't 100% remember everything you'll probably have a better idea of stuff like that than I would just due to knowing your own coding style, at least from when I've looked at old projects of mine that's how it seemed.

Using DX should be fine there, but I rely on the tests to keep me honest - I'd forgotten I needed to preserve SI, but luckily that was caught.

I'll admit I don't fully understand the part in lines 867-873, I thought this was to deal with leading zeroes somehow, but I can't tell any difference any behavior from my version that was missing that part, maybe it's just an optimization for the number "0h" / "00h":

.HexSuffix:
        dec cl
        cmp byte [bx], '0'
        jne .HexCheckLen
        inc bx
        dec cl
        jnz .HexCheckLen
        ret

Your C version is much cleaner than the lazy hack I made for that after the OP as well. So overall looks great to me.

It is skipping a (single) leading '0' (and handling '0H' directly). If not there the length check (at .HexCheckLen) will fail for '0abcdH' (cl would be 5).

BTW I also noticed that sasm doesn't support the "jle" and "jge" instructions. These are just synonyms for the existing "jng" and "jnl" mnemonics so are really easy to just add to the table:

 ▶ git diff sasm.asm sasm.c
diff --git a/sasm.asm b/sasm.asm
index 0ff60e9..91a920e 100644
--- a/sasm.asm
+++ b/sasm.asm
@@ -3456,8 +3456,12 @@ DispatchList:
     dw InstJCC
     db 'JL',0,0,0,0, CC_L
     dw InstJCC
+    db 'JGE',0,0,0,  CC_NL
+    dw InstJCC
     db 'JNL',0,0,0,  CC_NL
     dw InstJCC
+    db 'JLE',0,0,0,  CC_NG
+    dw InstJCC
     db 'JNG',0,0,0,  CC_NG
     dw InstJCC
     db 'JG',0,0,0,0, CC_G
diff --git a/sasm.c b/sasm.c
index b73deb2..4d6dad6 100644
--- a/sasm.c
+++ b/sasm.c
@@ -1919,8 +1919,10 @@ static const struct {
     { "JPE"    , &InstJcc         , JPE  },
     { "JPO"    , &InstJcc         , JPO  },
     { "JL"     , &InstJcc         , JL   },
+    { "JGE"    , &InstJcc         , JNL  },
     { "JNL"    , &InstJcc         , JNL  },
     { "JNG"    , &InstJcc         , JNG  },
+    { "JLE"    , &InstJcc         , JNG  },
     { "JG"     , &InstJcc         , JG   },
     };

I've added these (and xlat).

I was going to add support for "xlat" as well, but upon starting I noticed that sasm does already implement it as "xlatb", and documentation seems to indicate that's actually correct, so even though I've always seen it written just "xlat" in every assembler and debugger I've used before, I'm unsure whether or not it'd be appropriate to just add "xlat" as a synonym to "xlatb" as well.

I originally added it as xlatb since that's what nasm disassembles it to, but I agree that "xlat" is more natural.

From what I understand apparently Intel intended you to give some bogus argument to the "xlat" form for "documentation" that's then completely ignored and bx used instead, maybe something like "xlat [table]". Seems like a terribly dumb idea to me from a maintainability standpoint, when you could just put a comment instead, but yeah. Another source says it allows for an segment override, so I guess something like "xlat [es:bx]", which at least could be potentially useful although I'm not sure if that's true.

However, I'm guessing as to Intel's intended syntax as I can't find any actual complete examples, and I can't get any of these forms to work in nasm, only plain "xlat" with no argument as I've usually seen, which actually appears incorrect from the documentation, so, lol....

Yeah, I think xlat allows for segment override. The reason I like nasm syntax is that it isn't clunky, so I don't like a "dummy" argument that always has to be some for of [(seg:)bx]. The nasm way would be es xlat (https://www.nasm.us/xdoc/2.16.03rc1/html/nasmdoc3.html#section-3.1), but that isn't supported (yet). As a workaround you can use:

        db 0x26 ; ES segment override for..
        xlat
rogueeve575 commented 4 months ago

If you search for "xlat" on this page, NASM apparently just treats xlat as a synonym for xlatb with no argument expected or possible. I've NEVER seen the form "xlat [some label that presumably is also loaded into bx but ain't nobody checking]", so, maybe other people thought it was a dumb idea too and didn't implement it.

I actually kind of like "es xlat". I don't have a use for it right now, but I'll have to try that (or it's db-bodged form).

Nice reference, x86 is such a nonorthogonal mess that whenever I do 8086 assembly there always ends up being a case at some point where I'm just trying different things to see if they assemble and then going, "oh well that's unfortunate, now I have to spill......am I sure there's not some other form that's faster?". The only CPU I've worked with where every supposed "general purpose" register has a caveat of "but you must use it when doing this instruction and oh you can't with that one".