mortbopet / Ripes

A graphical processor simulator and assembly editor for the RISC-V ISA
https://ripes.me/
MIT License
2.49k stars 270 forks source link

Fix ecall support for read-only file operations #325

Closed raccog closed 8 months ago

raccog commented 8 months ago

When working with file operations, the ecall instruction would not recognize the O_RDONLY flag. It would also accidentally use the QIODevice::Append flag in read-only mode, which would then silently fail and return the wrong characters. This fixes the flags so that read-only files return the correct characters (fixes #316).

Thanks to @lukasrad02 for reporting!

lukasrad02 commented 8 months ago

Thank you for providing a fix! During testing, I've noticed that there are some more places where the 0-flag causes bugs.

At https://github.com/mortbopet/Ripes/blob/e837e233ae395db6c52090908feddc2ede8193e4/src/syscall/systemio.h#L196 an equality check is used to test for a flag. While this ensures support for the zero flag, it breaks when checking for multiple flags with an "any of" semantic, e. g. at https://github.com/mortbopet/Ripes/blob/e837e233ae395db6c52090908feddc2ede8193e4/src/syscall/systemio.h#L412 Therefore, either the check in writeToFile() has to check each flag individually and then OR the results, or fdInUse() has to be patched similarly to the fix in openFilestream() you've provided.

In general, it might be worth to consider changing the flags so that each flag has exactly one bit set to 1, so that simple bit operations work. This would – of course – not be backwards compatible, but if I did not miss anything in the commit history, the zero flag bug exists from the first release of the file I/O feature, hence backwards compatibility might not be necessary.

If you prefer to keep this PR small, I could also create a separate issue for the writeToFile() bug and/or the discussion on backwards compatibility.

raccog commented 8 months ago

@lukasrad02 Thank you so much for continuing to submit well-written bug reports! Your descriptions have made it much easier to work on fixing these bugs. 👍

In general, it might be worth to consider changing the flags so that each flag has exactly one bit set to 1, so that simple bit operations work. This would – of course – not be backwards compatible, but if I did not miss anything in the commit history, the zero flag bug exists from the first release of the file I/O feature, hence backwards compatibility might not be necessary.

@mortbopet Gets the final say in this, but I think that we aren't too worried about keeping backwards compatibility for most of the internal structure of Ripes (the filesystem-cached settings are one exception to this, but these flags aren't used there). So, this sounds like a good solution to me.

If you prefer to keep this PR small, I could also create a separate issue for the writeToFile() bug and/or the discussion on backwards compatibility.

No worries. Unless @mortbopet thinks that a separate discussion on backwards compatibility is necessary, we can keep this in a single issue/PR.

raccog commented 8 months ago

@lukasrad02 I ended up aligning the flags with their Linux counterparts, which can be found here: https://github.com/torvalds/linux/blob/master/include/uapi/asm-generic/fcntl.h.

O_RDONLY, O_WRONLY, and O_RDWR benefit from being distinct values (0, 1, and 2) in the first two bits of the flag because each of those flags are supposed to be mutually exclusive. If both of the first two bits are set, (defined as O_ACCMODE or 3) an error can be returned from the syscall, since this means multiple flags have been set. This makes it easy to ensure that only one of those three flags are set by ensuring that (flag & O_ACCMODE) != O_ACCMODE. Then each bit after the second bit defines an optional flag that can be ORed with the original three.

I made the following assembly program to test out this fix. If you want to keep researching these bugs, I would recommend you mess around with it and see if you can find any more bugs in the syscall implementations.

# Program should write "BBAABBAABBAABBAABBAA" to /tmp/test.txt
# and then read back the file and print the string to the console.
# It should exit with code 0

.data

strlen: .word 2
aa: .string "AA"
bb: .string "BB"

filename: .string "/tmp/test.txt"

open_failed: .string "Failed to open file"
write_failed: .string "Failed to write to file"
read_failed: .string "Failed to read from file"
seek_failed: .string "Failed to seek file to beginning"

# File open modes
O_RDONLY:  .word 0x0000
O_WRONLY:  .word 0x0001
O_RDWR:    .word 0x0002
O_ACCMODE: .word 0x0003
# Additional file flags
O_CREAT:  .word 0x0100
O_EXCL:   .word 0x0200
O_TRUNC:  .word 0x1000
O_APPEND: .word 0x2000

# syscalls
PRINT_STR: .word 4
EXIT:  .word 10
CLOSE: .word 57
LSEEK: .word 62
READ:  .word 63
WRITE: .word 64
EXIT2: .word 93
OPEN:  .word 1024

BUF_START: .word 0x2000

LOOPS: .word 10

.text

# Set a1 to null pointer to skip error message
mv a1, zero

# Error message procedure. Prints error message and exits
# a0 should be set to the error code
# a1 should be set to a pointer to an error message string
err:
beq a1, zero, 1f # Skip if string is null ptr
mv s0, a0 # Save error code
mv a0, a1 # a0: Pointer to error message
lw a7, PRINT_STR # Print syscall
ecall # Prints error message

mv a0, s0 # Get error message
lw a7, EXIT2 # Exit syscall
ecall # Exits program with error code
1:

open:
# Open file
la a0, filename # a0: Pointer to filename
lw a1, O_RDWR # Open flags
lw t0, O_TRUNC # Truncate file
or a1, a1, t0
lw t0, O_CREAT # Create file if it doesnt exist
or a1, a1, t0
lw a7, OPEN # "Open" ecall
ecall # Returns: File descriptor to a0
mv s0, a0 # Save file descriptor in s0

# Display error message and exit if open failed
li t0, -1
bne a0, t0, 1f
bne a0, zero, 1f
la a1, open_failed
j err
1:

lw s1, LOOPS # Number of times to write to
write_loop:
mv a0, s0 # Get file descriptor

write:
# Write string to file
la a1, aa # a1: Pointer to 'AA' string
andi t0, s1, 1 # Write 'AA' if even and 'BB' if odd
bne t0, zero, 1f
la a1, bb # a1: Pointer to 'BB' string
1:
lw a2, strlen # a2: Buffer size
lw a7, WRITE # "Write" ecall
ecall # Returns: Number of bytes written to a0

# Display error message and exit if write failed
lw t0, strlen
beq a0, t0, 1f
la a1, write_failed
j err
1:

addi s1, s1, -1 # Subtract from loop counter
bne s1, zero, write_loop # Write again

seek:
# Seek file to beginning
mv a0, s0 # a0: File descriptor
mv a1, zero # a1: Offset to seek
mv a2, zero # a2: Base of seek
lw a7, LSEEK # LSeek syscall
ecall # Should return 0 to a0 if successful

# Display error message and exit if seek failed
beq a0, zero, 1f
la a1, seek_failed
j err
1:

read:
# Read from file
mv a0, s0 # a0: File descriptor
lw a1, BUF_START # a1: Buffer start
lw a2, strlen # a2: Number of bytes to read
lw t0, LOOPS
mul a2, a2, t0 # Multiply by number of times written
mv s1, a2 # Save number of bytes written to check if successful
lw a7, READ # Read syscall
ecall # Should return strlen to a0 if successful

# Display error message and exit if read failed
beq a0, s1, 1f
la a1, read_failed
j err
1:

print:
# Print string read from file
lw a0, BUF_START # a0: Pointer to string read from file
lw a7, PRINT_STR # Print string syscall
ecall

# Exit successfully
lw a7, EXIT # Successful exit syscall
ecall

While working on this, I found some other bugs surrounding ecall; not sure if they are known yet. I'll create issues for them soon if not. But here's some quick descriptions:

  1. One bug occurs when you clock and then undo an ecall instruction. This causes a SIGABRT, possibly from a failed host machine syscall because this is the disassembled code that causes the SIGABRT:
   0x7ffff5cac829                  e8 e2 5a 05 00       call   0x7ffff5d02310 <getpid>
   0x7ffff5cac82e                  44 89 ea             mov    %r13d,%edx
   0x7ffff5cac831                  89 de                mov    %ebx,%esi
   0x7ffff5cac833                  89 c7                mov    %eax,%edi
   0x7ffff5cac835                  b8 ea 00 00 00       mov    $0xea,%eax
   0x7ffff5cac83a                  0f 05                syscall
-> 0x7ffff5cac83c                  89 c3                mov    %eax,%ebx
   0x7ffff5cac83e                  f7 db                neg    %ebx

This may be more difficult to solve, since no valid stack trace is given from the crash.

  1. The ecall instruction seems to be executed during the fetch stage instead of the execute stage. In the single-cycle processors, this causes the ecall instruction to be executed during the previous instruction's clock cycle instead of during the following clock cycle. On multi-cycle processors without hazard detection units, this causes the error Unknown system call in register 'a7': 0 to appear. This is just what I think is happening from the testing I've done, I have not gone into the VSRTL code to verify my assumptions, yet.
mortbopet commented 8 months ago

Awesome - I'm not too concerned about backwards compatability; but you bring a good point @raccog that there should probably be some mechanism to invalidate cached program state from prior versions.

One bug occurs when you clock and then undo an ecall instruction. This causes a SIGABRT, possibly from a failed host machine syscall because this is the disassembled code that causes the SIGABRT:

It's expected behavior that one cannot undo past an ecall and expect that whatever sideeffects that ecall had would have been undone. It is technically possible, but i think it's a bit silly since it's a massive pain for people to have to implement undo-ing of all ecalls. However, there should obviously not be a crash when doing this, so that's definitely a bug.

@raccog can you add your assembly program add your program as a test in https://github.com/mortbopet/Ripes/blob/master/test/tst_riscv.cpp?

raccog commented 8 months ago

It's expected behavior that one cannot undo past an ecall and expect that whatever sideeffects that ecall had would have been undone. It is technically possible, but i think it's a bit silly since it's a massive pain for people to have to implement undo-ing of all ecalls. However, there should obviously not be a crash when doing this, so that's definitely a bug.

I'll submit this as an issue soon.

@raccog can you add your assembly program add your program as a test in https://github.com/mortbopet/Ripes/blob/master/test/tst_riscv.cpp?

Sure!

@mortbopet What's your opinion on the effects caused from the ecall instruction seemingly executing at the wrong stage/clock cycle?

The ecall instruction seems to be executed during the fetch stage instead of the execute stage. In the single-cycle processors, this causes the ecall instruction to be executed during the previous instruction's clock cycle instead of during the following clock cycle. On multi-cycle processors without hazard detection units, this causes the error Unknown system call in register 'a7': 0 to appear. This is just what I think is happening from the testing I've done, I have not gone into the VSRTL code to verify my assumptions, yet.

I was wondering if you had any guesses as to why this might be happening. I haven't had the chance to look deeply at the VSRTL code yet to check my assumptions.

It only really seems like a problem in processors without hazard detection. I see similar bug reports in #275 and #233. In #233, you stated:

The pipelined processors without hazard units are not supposed to execute a program correctly. They are meant as intermediate design points to be able to illustrate why hazard detection and resolution is required for a pipeline. It should probably be more clear in the program/processor selection dialog that these processors aren't guaranteed to produce correct results..!

So, is the Unknown system call in register 'a7': 0 error message considered normal for processors without hazard detection? Do you think it might be beneficial (or possible) to tell the user in this message that the error was caused by the lack of hazard detection in the processor? Or maybe write as a note in the processor's description that normal ecall usage is likely to fail?

mortbopet commented 8 months ago

So, is the Unknown system call in register 'a7': 0 error message considered normal for processors without hazard detection? Do you think it might be beneficial (or possible) to tell the user in this message that the error was caused by the lack of hazard detection in the processor? Or maybe write as a note in the processor's description that normal ecall usage is likely to fail?

I think it should be made clear(er) in the processor description. I'm unsure whether it's a good idea to add it to the error message, since it'll require some fairly processor specific logic (when's it a hazard related issue vs. actual unknown system call?) which (as you know) i like to keep away from Ripes (i.e. Ripes is a generic front-end for ISAs and their processor implementations).