Misc issues - Githubissues

ghaerr / elks

Embeddable Linux Kernel Subset - Linux for 8086

Other

1.03k stars 108 forks source link

Misc issues #1090

Closed Mellvik closed 2 years ago

Mellvik commented 2 years ago

Issues encountered during the last leg of ftp/ftpd testing/development. They may need their own individual problem reports for discussions, (or may be known issues, even features), but let's start here anyway. All verifiable in QEMU.

ktcp: When an application repeatedly fails to close opened sockets, ktcp will eventually run out of memory. When it happens, that condition is not handled gracefully: A number of sys_brk errors are reported on the console, apparently not affecting the application to begin with. Eventually, the system hangs, see screenshot. It would be very beneficial if the out-of-resource condition was communicated back to the application, avoiding the hang.
alarm: The alarm() library routine exists, but does not seem to work. This may be a known issue.
[minor annoyance] A (presumably) debug output from ktcp to the console is emitting 10 or so blank lines every so often depending on the amount of traffic.
If a program is running - and recently started so some of its blocks are in the buffer cache, and the program file is overwritten, and attempt to run the (now replaced) program will fail until the original copy has exited. Example: ftp is running on ttyS0. A new copy of ftp is retrieved via the net on ttyS1, overwriting the disk copy of the running instance. An attempt to run the new ftp will fail - either just exit or hang, until the running copy has terminated. Is this a feature or a bug?
```
# sum ftp            <---- this is the running copy (in a different window)
57517     42
# sum ftp           <----- New copy transferred via ftp, overwrote the original
27059     42
# ./ftp               <----- just returns to prompt
# sync
# ./ftp              <---- just returns to prompt
# ./ftp              <---- now the other copy has terminated
ftp> pwd
Not connected.
ftp> open 10.0.2.2
Connected to 10.0.2.2.
...
```
@ghaerr, I've tried to recreate the too many open files problem I reported about a month ago, no luck. There is no console message, but the applications is responding with the correct perror() message.

ghaerr commented 2 years ago

Hello @Mellvik! Here's comments on each of your findings:

ktcp: Unfortunately, the behavior of ELKS programs running out of memory isn't great right now, and I'm not sure what to do about it. The lower level of the malloc C library routine tries to get memory using sbrk which, when it fails, the kernel reports what you're seeing (since otherwise sometimes we'd never know about the issue), and then usually tries to grab half as much memory, down to around 512 bytes. The error is reported back to the calling application, and currently all memory allocation is checked by ktcp. I would suggest that we tackle memory allocation behavior in the C library in v0.6 rather than attempting to change low level behavior now.

The issue with ktcp running out of memory is likely compounded in that there really isn't good error reporting between the kernel which allocates sockets and ktcp which does the work. In other words, ktcp may run out of memory and know it, but can't tell the kernel about it, so an application won't know. We could possibly try closing sockets when out of memory, but the new failed allocation may not be associated with the socket ktcp happens to be working with at the time. I'll make a note of this and look further into it, but it's a known issue for all applications, which is why the kernel printk has been left in.

alarm - actually there is no C library alarm, an old sleep routine used something like it but is commented out. There is not system call either. I'll add this to the list to implement, as it will be useful for handling connect or accept timeouts.
overwriting a running application's code file: "Is this a feature or a bug?" - answer: dangerous!! ELKS tries to implement shared-text (code) so what you're seeing is a dangerous mix of re-using the old (in-use) code segment while trying to read the new data segment from the new file. So you'd be running a mix of the old and new application! There is no easy "fix" for this activity other than perhaps disallowing a creat (open for write) on a running application - but that might not be what you want.
I've tried to recreate the too many open files problem I reported about a month ago, no luck. There is no console message, but the applications is responding with the correct perror() message - I would suggest closing that issue, and opening it again should you be able to reproduce. I've looked at it and can't find the reason for the problem.

Thank you!

Mellvik commented 2 years ago

Thank you @ghaerr,

thorough as usual!!

ktcp: Unfortunately, the behavior of ELKS programs running out of memory isn't great right now, and I'm not sure what to do about it. The lower level of the malloc C library routine tries to get memory using sbrk which, when it fails, the kernel reports what you're seeing (since otherwise sometimes we'd never know about the issue), and then usually tries to grab half as much memory, down to around 512 bytes. The error is reported back to the calling application, and currently all memory allocation is checked by ktcp. I would suggest that we tackle memory allocation behavior in the C library in v0.6 rather than attempting to change low level behavior now. Makes sense! The issue with ktcp running out of memory is likely compounded in that there really isn't good error reporting between the kernel which allocates sockets and ktcp which does the work. In other words, ktcp may run out of memory and know it, but can't tell the kernel about it, so an application won't know. We could possibly try closing sockets when out of memory, but the new failed allocation may not be associated with the socket ktcp happens to be working with at the time. I'll make a note of this and look further into it, but it's a known issue for all applications, which is why the kernel printk has been left in.

OK, I seem to remember we've been down part of this road before alarm - actually there is no C library alarm, an old sleep routine used something like it but is commented out. There is not system call either. I'll add this to the list to implement, as it will be useful for handling connect or accept timeouts.

Actually the stub that satisfies the reference to alarm() should be removed, don't you think? What about setitimer()? overwriting a running application's code file: "Is this a feature or a bug?" - answer: dangerous!! ELKS tries to implement shared-text (code) so what you're seeing is a dangerous mix of re-using the old (in-use) code segment while trying to read the new data segment from the new file. So you'd be running a mix of the old and new application! There is no easy "fix" for this activity other than perhaps disallowing a creat (open for write) on a running application - but that might not be what you want. Actuall, that would make sense. I think avoiding confusion must be high priority, and an -EBUSY (or something) return when the file is executing, wouldn't be unreasonable at all. I've tried to recreate the too many open files problem I reported about a month ago, no luck. There is no console message, but the applications is responding with the correct perror() message - I would suggest closing that issue, and opening it again should you be able to reproduce. I've looked at it and can't find the reason for the problem.

Agreed.

Thanks again.

—M

ghaerr commented 2 years ago

Actually the stub that satisfies the reference to alarm() should be removed, don't you think?

Yes, I'll look into that. Turns out the stub is automatically generated by the system call list syscall.dat, but then routed to an invalid system call -EINVAL return.

What about setitimer()?

Setitimer appears to allow a timeout return from select, but doesn't send a signal. We're probably better off implementing alarm if the requirement is to return from a system call like accept or connect.

I think avoiding confusion must be high priority, and an -EBUSY (or something) return when the file is executing, wouldn't be unreasonable at all.

Ok - this would cause a transfer failure in the use case you described. I'll add this to the list!

ghaerr commented 2 years ago

Setitimer appears to allow a timeout return from select, but doesn't send a signal.

Oops! My statement above is incorrect, SIGALARM is sent. setitimer is also unimplemented in ELKS - I'll look into using this or alarm for the underlying system call.

Mellvik commented 2 years ago

I think avoiding confusion must be high priority, and an -EBUSY (or something) return when the file is executing, wouldn't be unreasonable at all.

Ok - this would cause a transfer failure in the use case you described. I'll add this to the list!

That would be fine. FTP handles all kinds of destination file 'inaccessibility' fine by now.

Thank you!

-M

Mellvik commented 2 years ago

@ghaerr, issue no 3 on the original post in this thread fell through the cracks. I finally looked into this, and the problem appears only if DEBUG_TCPDATA is activated in ktcp´s config.h.

It's real annoyance because it screws up the console display all the time. And it's independent of the current state of the ^P flag.

What happens is that for every (approx) 12k bytes received, ktcp will print about 1035 blanks to the console - in what seems like a buffer flush operation. Just guessing, but a cursory look a the code has not made my any wiser.

--M

Mellvik commented 2 years ago

I think we can close this issue now. All items have been addressed (thank you @ghaerr), 2 are still open, one of them specifically placed in the 0.6 basket (memory allocation).

@ghaerr, is wait on the 0.5 or 0.6 list?

Thank you!

-M

ghaerr commented 2 years ago

is wait on the 0.5 or 0.6 list?

Hmm, not sure about which wait issue you're talking about?

I have the following items, possibly from this issue on my list. All these would be for v0.6:

Possibly implement alarm, but won't work well across socket calls.
Implement SO_TIMEOUT socket option for connect and possibly accept. Default 10 seconds for connect.
Look into C library memory allocation and sys_brk messages

Anything else this issue brought up?

I think we can close this issue now.

Agreed. We should open a separate master v0.6 tracking issue after v0.5.

Mellvik commented 2 years ago

Oooops, my mistake - eyes say 'alarm', fingers say 'wait'. Sorry.

Thank you!

—M

is wait on the 0.5 or 0.6 list?

Hmm, not sure about which wait issue you're talking about?

I have the following items, possibly from this issue on my list. All these would be for v0.6:

Possibly implement alarm, but won't work well across socket calls. Implement SO_TIMEOUT socket option for connect and possibly accept. Default 10 seconds for connect. Look into C library memory allocation and sys_brk messages Anything else this issue brought up?

I think we can close this issue now.

Agreed. We should open a separate master v0.6 tracking issue after v0.5.

— Reply to this email directly, view it on GitHub https://github.com/jbruchon/elks/issues/1090#issuecomment-1014739400, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA3WGOGPTZM5Y62KVZ6PQXTUWRDSVANCNFSM5LKF4TIQ. Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub. You are receiving this because you were mentioned.