Why is it so slow? - Githubissues

jiayihu commented 4 years ago

I'm not expert with embedded, but why is semihosting so slow? Single characters appear on the host stdio taking some 100s of ms for each one. I know that semihosting is slow in general, but I hadn't any issue using it from Ada which uses the following implementation:

package body System.Semihosting is

   type SH_Word is new Interfaces.Unsigned_32;

   function To_SH_Word is new Ada.Unchecked_Conversion
     (Source => System.Address, Target => SH_Word);

   function Generic_SH_Call (R0, R1 : SH_Word) return SH_Word;
   --  Handles the low-level part of semihosting, setting the registers and
   --  executing a breakpoint instruction.

   subtype Syscall is SH_Word;

   SYS_WRITEC : constant Syscall := 16#03#;
   SYS_WRITE0 : constant Syscall := 16#04#;
   SYS_READC  : constant Syscall := 16#07#;

   --  Output buffer

   --  Because most of the time required for semihosting is not consumed for
   --  the data itself but rather in the handling of breakpoint and
   --  communication between the target and debugger, sending one byte costs
   --  almost as much time as sending a buffer of multiple bytes.
   --
   --  For this reason, we use an output buffer for the semihosting Put
   --  functions. The buffer is flushed when full or when a line feed or NUL
   --  character is transmitted.

   Buffer_Size : constant := 128;
   type Buffer_Range is range 1 .. Buffer_Size;
   Buffer : array (Buffer_Range) of Unsigned_8;
   Buffer_Index : Buffer_Range := Buffer_Range'First;

   procedure Flush;
   --  Send the content of the buffer with semihosting WRITE0 call

   ---------------------
   -- Generic_SH_Call --
   ---------------------

   function Generic_SH_Call (R0, R1 : SH_Word) return SH_Word is
      Ret : SH_Word;
   begin
      Asm ("mov r0, %1" & ASCII.LF & ASCII.HT &
           "mov r1, %2" & ASCII.LF & ASCII.HT &
           "bkpt #0xAB" & ASCII.LF & ASCII.HT &
           "mov %0, r0",
           Outputs  => (SH_Word'Asm_Output ("=r", Ret)),
           Inputs   => (SH_Word'Asm_Input ("r", R0),
                        SH_Word'Asm_Input ("r", R1)),
           Volatile => True,
           Clobber => ("r1, r0"));
      return Ret;
   end Generic_SH_Call;

   -----------
   -- Flush --
   -----------

   procedure Flush is
      Unref : SH_Word;
      pragma Unreferenced (Unref);
   begin
      if Buffer_Index /= Buffer'First then
         --  Set null-termination
         Buffer (Buffer_Index) := 0;

         --  Send the buffer with a semihosting call
         Unref := Generic_SH_Call (SYS_WRITE0, To_SH_Word (Buffer'Address));

         --  Reset buffer index
         Buffer_Index := Buffer'First;
      end if;
   end Flush;

   ---------
   -- Put --
   ---------

   procedure Put (Item : Character) is
      Unref : SH_Word;
      pragma Unreferenced (Unref);

      C : Character with Volatile;
      --  Use a volatile variable to avoid compiler's optimization

   begin
      if Item = ASCII.NUL then
         --  The WRITE0 semihosting call that we use to send the output buffer
         --  expects a null terminated C string. Therefore it is not possible
         --  to have an ASCII.NUL character in the middle of the buffer as this
         --  would truncate the buffer.
         --
         --  For this reason the ASCII.NUL character is sent separately with a
         --  WRITEC semihosting call.

         --  Flush the current buffer
         Flush;

         --  Send the ASCII.NUL with a WRITEC semihosting call
         C := Item;
         Unref := Generic_SH_Call (SYS_WRITEC, To_SH_Word (C'Address));

      else

         Buffer (Buffer_Index) := Character'Pos (Item);
         Buffer_Index := Buffer_Index + 1;

         --  Flush the buffer when it is full or if the character is a line
         --  feed.
         if Buffer_Index = Buffer'Last or else Item = ASCII.LF then
            Flush;
         end if;
      end if;
   end Put;

   ---------
   -- Put --
   ---------

   procedure Put (Item : String) is
   begin
      for Index in Item'Range loop
         Put (Item (Index));
      end loop;
   end Put;

   ---------
   -- Get --
   ---------

   procedure Get (Item : out Character) is
      Ret : SH_Word;
   begin
      Ret := Generic_SH_Call (SYS_READC, 0);
      Item := Character'Val (Ret);
   end Get;

end System.Semihosting;

I know that just pasting the Ada implementation may not be helpful, but unfortunately I'm myself a rookie at both languages and embedded programming (although the code is quite natural to read). I just checked this lib implementation and it already uses a buffer, so I don't what might be the issue. The asm instructions seem also to be the similar although Ada does something more with the registers, but don't trust me.

pftbest commented 4 years ago

For your information this library does not have a buffer. If you need a 128 byte buffer like you have in Ada, it seems you have to implement it yourself on top of hio::HStdout.

P.S. If you care about speed consider using ITM. It's harder to setup but it's much faster than semihosting.

jiayihu commented 4 years ago

Okay I was misleaded by these lines, what do they do then?

Anyway semihosting has always been fast enough for my uses, so adding a buffer on top of the lib should be enough. Would you accept a PR?

therealprof commented 4 years ago

P.S. If you care about speed consider using ITM. It's harder to setup but it's much faster than semihosting.

For debugging rtt seems more useful.

Okay I was misleaded by these lines, what do they do then?

Output buffering so you can send a whole buffer of bytes instead of having to spoonfed (and error check) each character individually. I don't see too much difference in what the Ada code is doing, maybe the syscall code is faster if you see a huge difference? No idea.

Anyway semihosting has always been fast enough for my uses, so adding a buffer on top of the lib should be enough. Would you accept a PR?

Use of semihosting is kind of discouraged, but If you can come up with trivial optimisation we'd probably take it.

I'd encourage you to check out https://probe.rs and further enhancements like https://github.com/knurling-rs/probe-run for better ways to get debug output.

pftbest commented 4 years ago

This lines are an unfortunate consequence of a unix style API for semihosting. When you write a byte array it is theoretically possible the write syscall returns a value that is smaller than the buffer length. In that case we repeat the write until whole array is sent. In practice this never happens so most implementations just ignore the return value like your Ada implementation above.

@therealprof There is no output buffering in this library.

pftbest commented 4 years ago

@therealprof Just to make it clear why there is a difference:

When you do

hprintln("Hello {} World!", 5);

it will perform 3 write syscalls "Hello ", "5" and " World!\n".

If each syscall takes 100ms that gives you 300ms in total.

In comparison, the Ada code above will make only 1 write syscall because it buffers the line in a 128 byte buffer. So it will take only 100ms not 300.

therealprof commented 4 years ago

In comparison, the Ada code above will make only 1 write syscall because it buffers the line in a 128 byte buffer. So it will take only 100ms not 300.

Well, that's an implementation detail of the formatting code, if you use:

hprintln("Hello 5 World!");

Then it'll be sent at once. If you don't formatting into a separate buffer you can also avoid that.

pftbest commented 4 years ago

True, but I think it would be useful to have an extra feature which gives you a buffer by default. Have you seen how many write calls does a simple #[derive(Debug)] struct do? It's very slow.

128 bytes of RAM is not zero cost either so it has to be an opt-in of course.

jiayihu commented 4 years ago

@therealprof thanks I've been able to migrate to probe-rs without any issue and it's much easier to use, so I don't have complains on semihosting anymore 😄 I leave closing the issue up to you guys

therealprof commented 4 years ago

True, but I think it would be useful to have an extra feature which gives you a buffer by default. Have you seen how many write calls does a simple #[derive(Debug)] struct do? It's very slow.

I don't. Nevertheless semihosting has so many drawbacks that it's only remaining application really is to communicate with Qemu for test purposes. Even if we could speed it up by a factor of 10 it would still be very slow and keep all its other disavantages so why even bother, especially at such a cost?

pftbest commented 4 years ago

Personally I'm not using semihosting anymore for a very long time. However when I was using it I did make my own buffer so I wouldn't vote against such feature. But if nobody else requires it, this can be closed. Maybe we should put a warning in the docs that this is very old technology and add links to other crates so people can find better options easily.

jonas-schievink commented 4 years ago

Closing this, as the question has been answered.

rust-embedded / cortex-m-semihosting

Why is it so slow? #58