Closed Mellvik closed 2 years ago
Hello @Mellvik! Here's comments on each of your findings:
malloc
C library routine tries to get memory using sbrk
which, when it fails, the kernel reports what you're seeing (since otherwise sometimes we'd never know about the issue), and then usually tries to grab half as much memory, down to around 512 bytes. The error is reported back to the calling application, and currently all memory allocation is checked by ktcp. I would suggest that we tackle memory allocation behavior in the C library in v0.6 rather than attempting to change low level behavior now.The issue with ktcp running out of memory is likely compounded in that there really isn't good error reporting between the kernel which allocates sockets and ktcp which does the work. In other words, ktcp may run out of memory and know it, but can't tell the kernel about it, so an application won't know. We could possibly try closing sockets when out of memory, but the new failed allocation may not be associated with the socket ktcp happens to be working with at the time. I'll make a note of this and look further into it, but it's a known issue for all applications, which is why the kernel printk has been left in.
alarm - actually there is no C library alarm, an old sleep
routine used something like it but is commented out. There is not system call either. I'll add this to the list to implement, as it will be useful for handling connect or accept timeouts.
overwriting a running application's code file: "Is this a feature or a bug?" - answer: dangerous!! ELKS tries to implement shared-text (code) so what you're seeing is a dangerous mix of re-using the old (in-use) code segment while trying to read the new data segment from the new file. So you'd be running a mix of the old and new application! There is no easy "fix" for this activity other than perhaps disallowing a creat
(open for write) on a running application - but that might not be what you want.
I've tried to recreate the too many open files problem I reported about a month ago, no luck. There is no console message, but the applications is responding with the correct perror() message - I would suggest closing that issue, and opening it again should you be able to reproduce. I've looked at it and can't find the reason for the problem.
Thank you!
Thank you @ghaerr,
thorough as usual!!
ktcp: Unfortunately, the behavior of ELKS programs running out of memory isn't great right now, and I'm not sure what to do about it. The lower level of the malloc C library routine tries to get memory using sbrk which, when it fails, the kernel reports what you're seeing (since otherwise sometimes we'd never know about the issue), and then usually tries to grab half as much memory, down to around 512 bytes. The error is reported back to the calling application, and currently all memory allocation is checked by ktcp. I would suggest that we tackle memory allocation behavior in the C library in v0.6 rather than attempting to change low level behavior now. Makes sense! The issue with ktcp running out of memory is likely compounded in that there really isn't good error reporting between the kernel which allocates sockets and ktcp which does the work. In other words, ktcp may run out of memory and know it, but can't tell the kernel about it, so an application won't know. We could possibly try closing sockets when out of memory, but the new failed allocation may not be associated with the socket ktcp happens to be working with at the time. I'll make a note of this and look further into it, but it's a known issue for all applications, which is why the kernel printk has been left in.
OK, I seem to remember we've been down part of this road before alarm - actually there is no C library alarm, an old sleep routine used something like it but is commented out. There is not system call either. I'll add this to the list to implement, as it will be useful for handling connect or accept timeouts.
Actually the stub that satisfies the reference to alarm() should be removed, don't you think? What about setitimer()? overwriting a running application's code file: "Is this a feature or a bug?" - answer: dangerous!! ELKS tries to implement shared-text (code) so what you're seeing is a dangerous mix of re-using the old (in-use) code segment while trying to read the new data segment from the new file. So you'd be running a mix of the old and new application! There is no easy "fix" for this activity other than perhaps disallowing a creat (open for write) on a running application - but that might not be what you want. Actuall, that would make sense. I think avoiding confusion must be high priority, and an -EBUSY (or something) return when the file is executing, wouldn't be unreasonable at all. I've tried to recreate the too many open files problem I reported about a month ago, no luck. There is no console message, but the applications is responding with the correct perror() message - I would suggest closing that issue, and opening it again should you be able to reproduce. I've looked at it and can't find the reason for the problem.
Agreed.
Thanks again.
—M
Actually the stub that satisfies the reference to alarm() should be removed, don't you think?
Yes, I'll look into that. Turns out the stub is automatically generated by the system call list syscall.dat, but then routed to an invalid system call -EINVAL return.
What about setitimer()?
Setitimer appears to allow a timeout return from select, but doesn't send a signal. We're probably better off implementing alarm if the requirement is to return from a system call like accept or connect.
I think avoiding confusion must be high priority, and an -EBUSY (or something) return when the file is executing, wouldn't be unreasonable at all.
Ok - this would cause a transfer failure in the use case you described. I'll add this to the list!
Setitimer appears to allow a timeout return from select, but doesn't send a signal.
Oops! My statement above is incorrect, SIGALARM is sent. setitimer
is also unimplemented in ELKS - I'll look into using this or alarm
for the underlying system call.
I think avoiding confusion must be high priority, and an -EBUSY (or something) return when the file is executing, wouldn't be unreasonable at all.
Ok - this would cause a transfer failure in the use case you described. I'll add this to the list!
That would be fine. FTP handles all kinds of destination file 'inaccessibility' fine by now.
Thank you!
-M
@ghaerr,
issue no 3 on the original post in this thread fell through the cracks.
I finally looked into this, and the problem appears only if DEBUG_TCPDATA is activated in ktcp
´s config.h.
It's real annoyance because it screws up the console display all the time. And it's independent of the current state of the ^P flag.
What happens is that for every (approx) 12k bytes received, ktcp will print about 1035 blanks to the console - in what seems like a buffer flush operation. Just guessing, but a cursory look a the code has not made my any wiser.
--M
I think we can close this issue now. All items have been addressed (thank you @ghaerr), 2 are still open, one of them specifically placed in the 0.6 basket (memory allocation).
@ghaerr, is wait
on the 0.5 or 0.6 list?
Thank you!
-M
is wait on the 0.5 or 0.6 list?
Hmm, not sure about which wait
issue you're talking about?
I have the following items, possibly from this issue on my list. All these would be for v0.6:
alarm
, but won't work well across socket calls.connect
and possibly accept
. Default 10 seconds for connect
.Anything else this issue brought up?
I think we can close this issue now.
Agreed. We should open a separate master v0.6 tracking issue after v0.5.
Oooops, my mistake - eyes say 'alarm', fingers say 'wait'. Sorry.
Thank you!
—M
is wait on the 0.5 or 0.6 list?
Hmm, not sure about which wait issue you're talking about?
I have the following items, possibly from this issue on my list. All these would be for v0.6:
Possibly implement alarm, but won't work well across socket calls. Implement SO_TIMEOUT socket option for connect and possibly accept. Default 10 seconds for connect. Look into C library memory allocation and sys_brk messages Anything else this issue brought up?
I think we can close this issue now.
Agreed. We should open a separate master v0.6 tracking issue after v0.5.
— Reply to this email directly, view it on GitHub https://github.com/jbruchon/elks/issues/1090#issuecomment-1014739400, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA3WGOGPTZM5Y62KVZ6PQXTUWRDSVANCNFSM5LKF4TIQ. Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub. You are receiving this because you were mentioned.
Issues encountered during the last leg of
ftp
/ftpd
testing/development. They may need their own individual problem reports for discussions, (or may be known issues, even features), but let's start here anyway. All verifiable in QEMU.ktcp
: When an application repeatedly fails to close opened sockets,ktcp
will eventually run out of memory. When it happens, that condition is not handled gracefully: A number ofsys_brk
errors are reported on the console, apparently not affecting the application to begin with. Eventually, the system hangs, see screenshot. It would be very beneficial if the out-of-resource condition was communicated back to the application, avoiding the hang.alarm
: Thealarm()
library routine exists, but does not seem to work. This may be a known issue.[minor annoyance] A (presumably) debug output from
ktcp
to the console is emitting 10 or so blank lines every so often depending on the amount of traffic.If a program is running - and recently started so some of its blocks are in the buffer cache, and the program file is overwritten, and attempt to run the (now replaced) program will fail until the original copy has exited. Example:
ftp
is running onttyS0
. A new copy offtp
is retrieved via the net onttyS1
, overwriting the disk copy of the running instance. An attempt to run the newftp
will fail - either just exit or hang, until the running copy has terminated. Is this a feature or a bug?@ghaerr, I've tried to recreate the too many open files problem I reported about a month ago, no luck. There is no console message, but the applications is responding with the correct
perror()
message.