ml4ai / tomcat

ToMCAT: Theory of Mind-based Cognitive Architecture for Teams
https://ml4ai.github.io/tomcat/
MIT License
21 stars 7 forks source link

Aurora crashes due to hardware buffer overflow #279

Closed adarshp closed 2 years ago

adarshp commented 2 years ago

The fNIRS recording program (Aurora) crashes randomly sometimes. This is not good.

We will look at the log files, contact NIRx (we need a detailed bug report).

We thought it might be related to a buffer problem with the high video streaming load, but that has been ruled out now since we reduced the face and screen capture resolution.

Kobus:

Eric: macOS might be freezing Aurora to save on resources. We should look into whether this is the case (and prevent it from doing that if so)

val-pf commented 2 years ago

Re: Updates: aurora notifies the user of any available software updates when started, and we did not receive any notifications when we opened aurora in the past month or so.

I have reached out to NIRx support with the logfiles and a broad problem description. Please feel free to add anything that might be helpful to the conversation here or to the email.

val-pf commented 2 years ago

Update 04/05 (Chinmai, Eric, Vatsav, Rick & Valeria) NIRx suggests running an activity monitor, (which we already did last week and we didn't see any issues with CPU/memory) We are running this now in the following ways:

  1. Running on Wifi connection instead of USB connection to NIRSport
    • 5-10% CPU use when only aurora is running (on Wifi)
    • will report if it crashes (Update at 2.30pm: no crash yet!)
  2. Running Aurora on Eric's / Caleb's iMAC laptops
    • to exclude any issues with the iMACs (they have better chip, RAM?)
    • Chinmai also offered to run it on his windows laptop for comparison

Alternative ways to treat "symptoms", not the underlying cause:

Caleb ran an overnight test yesterday (running only aurora) and one of the three crashed after 3.5 hrs, but it is unclear if that's a battery failure or a device failure

val-pf commented 2 years ago

update 04/06: it ran no problem for ~24hrs on Wifi. However, still need to work out if that's due to USB vs wifi or the load on the iMACs. will set up another over night test today.

val-pf commented 2 years ago

update 04/07: aurora ran no problem for ~24hrs on USB. Thus, it seems that it might be a problem with the buffer on the iMACs, meaning running things in addition to aurora (eyetracking, mumble, screen recording, face recording, minecraft, firefox, and/or baseline) likely causes the issue.

kobus-barnard commented 2 years ago

Thanks! (And thanks for the follow up with NRX).

One next step from our end would be to get a better understanding of how much of the resources each of those processes are using, so we can work on reducing the load. Some things could potentially be moved off of the iMac. Others could potentially be trimmed.

In addition, Eric suggested trying giving Aurora priority, and perhaps others less priority. This can be done with a script and the "nice" command once things are all up and running. I do not have a sense of how likely this is to help because it is not clear what about resource contention is causing Aurora to crash. Most likely it is memory not CPU.

eduongAZ commented 2 years ago

Update: Aurora ran no problem for 2.75 hours on USB while streaming fNIRS, eye-tracking, screen, and webcam data and while playing Minecraft for 2 hours.

Solution: finding Aurora process id (PID) and use renice command, for example, sudo renice -10 1234 with PID=1234, to increase the priority of Aurora on the iMacs might have solved the buffer issue.

kobus-barnard commented 2 years ago

Awesome! Also, Chinmai and I were discussing using an external webcams that go to CAT via USB to reduce load and give us more resolution options. Either way, we are probably collecting face data with excessive resolution. Its seems it is 1/2 of the data volume collected.

eduongAZ commented 2 years ago

Streaming all data - eyetracking, fNIRS, EEG, screen, and webcam - seems to only costed 45% of total CPU power. We still have 55% CPU left for baseline tasks and Minecraft, which should not take much CPU time. Stress test on Tuesday, April 12 will give us a detailed report of percentage of CPU usage of the applications.

ffmpeg still takes the most amount of CPU, at least 30%.

kobus-barnard commented 2 years ago

OK, that sounds great. I will look forward to the results of that test.

We should be careful about capturing data at higher resolution than we need, as long term storage could become an issue. I think we budgeted for file servers for around this phase of the project for this reason, but keeping each experiment to more like 100GB instead of 300GB would be good.

Also, I believe we are storing some data for later transfer that ideally should be transferred in real time eventually.

Anyway, if we want to go for external web cams, we probably should go over the ethernet, as USB cables need to be short for high bandwidth. So we would want to get some kind of an adaptor or (better) a native ethernet web cam.

If none of that works, then we could use standard web cams into the usb-C on the iMacs.

eduongAZ commented 2 years ago

Since we have about 50% of CPU power for baseline tasks and Minecraft, and the network speed seems to not be the problem with the current setup, I think we don't need external cameras.

However, if we do want to remove the 30% CPU time of ffmpeg (which I don't think is the case at the moment), plugging the webcam to the iMac does not remove ffmpeg from the iMacs, since we still need to extract the images from the webcam and send it to CAT, which is what we are currently doing with the default camera on the iMac. If we plan to use external cameras, then they should not be plugged into the iMacs.

kobus-barnard commented 2 years ago

This is basically correct. Plugging an external camera into the iMac would only help if you wanted to capture frames at a lower resolution than the iMac camera does. This would reduce, but not eliminate, the load, and also reduce the network use and storage (same as it would if they were fed into Cat). However, storage can be reduced after the fact. Anyway, let's see how far we get with restricting the time the camera is on to when the experiment is actually running as you have already suggested.

eduongAZ commented 2 years ago

April 12, 2022 update: Aurora did not crash during the testing session for about 2 hours with priority set to -10

Streaming EEG, fNIRS, and eye-tracking take 25-30% of CPU. Streaming EEG, fNIRS, eye-tracking webcam, screen, and mumble while playing Minecraft takes 45-55% of CPU.

val-pf commented 2 years ago

aurora crashed in today's pilot despite reassigned priority.

kobus-barnard commented 2 years ago

Can one of the CS students take a look on the iMac where it crashed by opening up the "console" app and seeing if there is any record in any of the logs and reports that you can investigate by clicking items in the left side menu. Thanks!

eduongAZ commented 2 years ago

We found that Aurora crashed after launching Minecraft clients on the iMacs. The issue might be that launching Minecraft client took most of CPU time, causing buffer flow problem in Aurora. See #302 for further discussion.

kobus-barnard commented 2 years ago

OK, so that should be reproducible, even if it does not always happen. But I gather starting it up again after such a crash works fine. Can we confirm this?

Also, does aurora tell the log anything, or does the process simply die and the OS reports that it has exited. If so, is there an exit code? A linux based OS always know the exit code of its failed children, and there are different codes for seg faults, etc.

eduongAZ commented 2 years ago

Issue #302 shows that Aurora did not crash only when launching Minecraft. The hypothesis is that there are processes that take a way CPU time from Aurora, causing Aurora hardware to crash due to overflow buffer.

CalebUAz commented 2 years ago

307 I believe we a potential fix?

eduongAZ commented 2 years ago

We will still test Aurora on the Mac laptops as alternative solution to #307

eduongAZ commented 2 years ago

Aurora did not crash when running on the laptops, except for when one of the laptop went to sleep, which paused Aurora and caused buffer overflow.

CalebUAz commented 2 years ago

Hi @adarshp & @kobus-barnard,

@eduongAZ, @kay-of-a, @rchamplin and I, observed that Aroura on leopard doesn't usually crash, so we decided to test if hardware is the issue or iMac/MacOs is the issue. I swapped fNIRS device between tiger and leopard, had it running for couple of minutes and then executed Minecraft on the iMacs then all of them crashed.

I swapped back the devices to its original place and performed the same test again. This time lion and leopard crashed.

It's hard to come up a conclusion with such random behavior.

I recently observed this device had one of its LED red. @rchamplin could you ask NIRx folks what this means? Image

val-pf commented 2 years ago

nirsport

It's an Error message and you need to restart the device. This is described in the "getting started" guide (paper) in the lab. I agree, the behavior is very random indeed!

CalebUAz commented 2 years ago

One thing to test out is the setting a larger scale on the Aroura visualizer. @rchamplin and I observed when we increase the scale and increase the time window for plotting Aroura "generally" doesn't crash.

adarshp commented 2 years ago

Update: NIRx is shipping us a Windows machine that they have tested, hopefully this will work. We will reopen the issue if it does not.