jetperch / pyjoulescope

Joulescope driver and utilities
https://www.joulescope.com
Apache License 2.0
37 stars 11 forks source link

Consistent crash on Mac OSX #9

Closed tannewt closed 4 years ago

tannewt commented 4 years ago

Hi! I'm consistently crashing pyjoulescope when I disrupt the USB bus by turning on a CircuitPython board which is then bulk read for mass storage by the OS.

I'm on Mac OSX 10.15.3. the devices are connected through the same hub but I've seen it on separate hubs to different USB ports on the MacbookPro. I bet it's all one high speed and full speed bus.

Here is my command line output:

*****************************
* Joulescope User Interface *
* UI Version = 0.8.6        *
* Driver Version = 0.8.7    *
*****************************

info = {
  "joulescope": {
    "ui_version": "0.8.6",
    "driver_version": "0.8.7"
  },
  "platform": {
    "name": "darwin",
    "python_version": "3.7.6 (default, Dec 30 2019, 19:38:26) \n[Clang 11.0.0 (clang-1100.0.33.16)]",
    "platform": "Darwin-19.3.0-x86_64-i386-64bit",
    "processor": "i386",
    "executable": "/Volumes/GitRepos/pyjoulescope_ui/.venv/bin/python3",
    "frozen": false,
    "paths": {
      "dirs": {
        "app_path": "/Users/tannewt/Library/Application Support/joulescope",
        "config": "/Users/tannewt/Library/Application Support/joulescope/config",
        "log": "/Users/tannewt/Library/Application Support/joulescope/log",
        "firmware": "/Users/tannewt/Library/Application Support/joulescope/firmware",
        "data": "/Users/tannewt/Documents/joulescope"
      },
      "files": {
        "config": "/Users/tannewt/Library/Application Support/joulescope/config/joulescope_config.json"
      }
    }
  }
}

=====
ERROR:2020-03-13 17:14:14,752:device_thread.py:151:joulescope.usb.device_thread:device thread hung: control_transfer_in - FORCE CLOSE
fish: 'python3 -m joulescope_ui' terminated by signal SIGSEGV (Address boundary error)

Let me know if there is more debugging output I can give. Thanks!

mliberty1 commented 4 years ago

Hi @tannewt and thanks for reporting this issue! I am trying to duplicate it, and I have had no luck so far. I am using a 2018 13" MacBook Pro also running macOS X 10.15.3. I see that you are running from source / PyPi, so I tried that first. Here is my setup:

*****************************
* Joulescope User Interface *
* UI Version = 0.8.6        *
* Driver Version = 0.8.7    *
*****************************

info = {
  "joulescope": {
    "ui_version": "0.8.6",
    "driver_version": "0.8.7"
  },
  "platform": {
    "name": "darwin",
    "python_version": "3.7.6 (default, Dec 30 2019, 19:38:28) \n[Clang 11.0.0 (clang-1100.0.33.16)]",
    "platform": "Darwin-19.3.0-x86_64-i386-64bit",
    "processor": "i386",
    "executable": "/usr/local/opt/python/bin/python3.7",
    "frozen": false,
    "paths": {
      "dirs": {
        "app_path": "/Users/mliberty/Library/Application Support/joulescope",
        "config": "/Users/mliberty/Library/Application Support/joulescope/config",
        "log": "/Users/mliberty/Library/Application Support/joulescope/log",
        "firmware": "/Users/mliberty/Library/Application Support/joulescope/firmware",
        "data": "/Users/mliberty/Documents/joulescope"
      },
      "files": {
        "config": "/Users/mliberty/Library/Application Support/joulescope/config/joulescope_config.json"
      }
    }
  }
}

I am using a Feather M4 Express running CircuitPython 5.0.0-beta.5. It currently has the stock image provided by Adafruit. I can insert the Feather, remove it, and reset it. I see temporary Joulescope data loss on these events, but at least so far, no crashes like you see. I typically see something like this:

WARNING:2020-03-14 08:54:15,375:view.py:224:joulescope.stream_buffer:_stats_get no samples
WARNING:2020-03-14 08:54:15,375:view.py:224:joulescope.stream_buffer:_stats_get no samples
WARNING:2020-03-14 08:54:15,689:main.py:1291:joulescope_ui.main:status RED: d_sample_id=1084230, d_sample_missing_count=310086

I see the same behavior with inserting a normal USB flash drive. While dropping data is not good, it is much better than crashing. Joulescope currently uses USB bulk IN, so it really cannot control what the host computer does. I have been thinking about implementing USB isochronous IN which may also solve this issue.

I wonder if there is some other library issue. Do you see the same behavior if you use the official 0.8.6 distribution?

https://download.joulescope.com/joulescope_install/0/8/6/joulescope_0_8_6.dmg

I have a few questions:

  1. Do you see the same behavior with a normal USB flash drive?
  2. Have you noticed if the contents or code on the CircuitPython board make a difference? If you have, could you give me some pointers on what to try? I am happy to order the exact board that you have just to remove that variable.

Thanks!

tannewt commented 4 years ago

I haven't tried this with a normal USB drive and wouldn't expect an issue. I'm adding sleep support to CircuitPython which isn't work yet. CIRCUITPY doesn't even show at the moment. So ya, bad USB things are happening. :-) (You are in good company. My bad USB code in the past has also crashed the Mac itself.)

I expect the data blip that occurs but would like it to not crash.

I just reproed it again with the same Joulescope setup and got more info from Mac. I put it all here: https://gist.github.com/tannewt/49544eb93b776c5adae0264bba30a321 Here is my current CircuitPython code for the Feather M4 Express: circuitpython-feather_m4_express-76ca13b-dirty.uf2.zip We can get you any needed hardware for free as well but I had no trouble reproing with a Feather M4.

The official distribution also crashes after a few seconds with the Feather M4 connected. The crash info for it is here: https://gist.github.com/tannewt/7bb1e44ddd9564fe3bdfaf0f4d250e91

Clearly this is a corner of a corner case but it'd still be nice if it hung on while I abuse USB. Thanks!

mliberty1 commented 4 years ago

Thanks for the additional information! When developing Joulescope, I caused multiple BSODs in the USB and WinUSB stacks. Fun. Recently, I was able to cause a BSOD with just a bad USB Audio Class descriptor, so I am very familiar with USB woes. However, the issue you have found is likely in the Joulescope user-space code on top of libusb.

I was not able to duplicate your issue on my MacBook. After flashing the Feather M4 Express, I do see that the Feather enumerates correctly and shows up as MSC. The boot_out.txt contains:

Adafruit CircuitPython 5.0.0-97-g76ca13b6e-dirty on 2020-03-16; Adafruit Feather M4 Express with samd51j19

With Joulescope streaming using UI 0.8.6, I tried unplugging and resetting the Feather, but the UI kept on running (despite loss of samples during these events). I tried no hub, with hub, on same hub, and on different hubs, but it just kept running.

I took a look at the stack traces (thank you!), and I found at least one issue that could explain what you are seeing. The Joulescope libusb user-space driver accesses the native libusb library using ctypes. All the ctypes structures are Python objects. When Python objects go out of scope, they will get garbage collected and become invalid pointers for the native libusb code. I found a few exceptional paths where the Python objects would go out of scope before the cancelled transfers actually complete. The callbacks invoked by the native libusb code are then invalid, which means undefined behavior. I am working to fix these cases now, but it may take some time...

mliberty1 commented 4 years ago

I have a fix that is working so far in the develop branches of both pyjoulescope and pyjoulescope_ui. Assuming that you are cool with running from source, here is how you can try this out:

pip3 uninstall joulescope joulescope_ui
cd pyjoulescope
export PYTHONPATH=`pwd`:`pwd`_ui
git pull
git checkout develop
python3 setup.py build_ext --inplace
cd ../pyjoulescope_ui
git pull
git checkout develop
python3 setup.py qt
python3 -m joulescope

Note that if you don't already have the repos cloned, you can do this first:

git clone --branch develop https://github.com/jetperch/pyjoulescope.git
git clone --branch develop https://github.com/jetperch/pyjoulescope_ui.git

The most interesting commit is 74a2904. b706b31 also improved thread joining for some of these exception cases.

I look forward to hearing what you see!

tannewt commented 4 years ago

I think I got it all updated locally. I did have to do a pip install -e . in pyjoulescope. Does this look right?

*****************************
* Joulescope User Interface *
* UI Version = 0.8.8-dev    *
* Driver Version = 0.8.8    *
*****************************

info = {
  "joulescope": {
    "ui_version": "0.8.8-dev",
    "driver_version": "0.8.8"
  },
  "platform": {
    "name": "darwin",
    "python_version": "3.7.6 (default, Dec 30 2019, 19:38:26) \n[Clang 11.0.0 (clang-1100.0.33.16)]",
    "platform": "Darwin-19.3.0-x86_64-i386-64bit",
    "processor": "i386",
    "executable": "/Volumes/GitRepos/pyjoulescope_ui/.venv/bin/python3",
    "frozen": false,
    "paths": {
      "dirs": {
        "app_path": "/Users/tannewt/Library/Application Support/joulescope",
        "config": "/Users/tannewt/Library/Application Support/joulescope/config",
        "log": "/Users/tannewt/Library/Application Support/joulescope/log",
        "firmware": "/Users/tannewt/Library/Application Support/joulescope/firmware",
        "data": "/Users/tannewt/Documents/joulescope"
      },
      "files": {
        "config": "/Users/tannewt/Library/Application Support/joulescope/config/joulescope_config.json"
      }
    }
  }
}

=====

I'm happy to use a dev version while I hack on things.

mliberty1 commented 4 years ago

Yes, that ui_version and driver_version look right. 🤞

tannewt commented 4 years ago

I haven't had a full on crash yet! I have been getting these Python errors on the command line though:

WARNING:2020-03-19 13:47:39,138:device.py:820:joulescope.usb.libusb.device:Could not shut down gracefully
ERROR:2020-03-19 13:47:40,834:main.py:504:joulescope_ui.main:statusUpdateTimer failed - assume device error
Traceback (most recent call last):
  File "/Volumes/GitRepos/pyjoulescope_ui/joulescope_ui/main.py", line 498, in on_statusUpdateTimer
    s = self._device.status()
  File "/Volumes/GitRepos/pyjoulescope/joulescope/driver.py", line 911, in status
    status['driver'] = self._status()
  File "/Volumes/GitRepos/pyjoulescope/joulescope/driver.py", line 853, in _status
    value=0, index=0, length=STATUS_REQUEST_LENGTH)
  File "/Volumes/GitRepos/pyjoulescope/joulescope/usb/device_thread.py", line 218, in control_transfer_in
    return self._post_block('control_transfer_in', (args, kwargs))
  File "/Volumes/GitRepos/pyjoulescope/joulescope/usb/device_thread.py", line 176, in _post_block
    raise IOError(rv)
OSError
ERROR:2020-03-19 13:47:40,847:driver.py:534:joulescope.driver:USB stop failed
Traceback (most recent call last):
  File "/Volumes/GitRepos/pyjoulescope_ui/joulescope_ui/main.py", line 498, in on_statusUpdateTimer
    s = self._device.status()
  File "/Volumes/GitRepos/pyjoulescope/joulescope/driver.py", line 911, in status
    status['driver'] = self._status()
  File "/Volumes/GitRepos/pyjoulescope/joulescope/driver.py", line 853, in _status
    value=0, index=0, length=STATUS_REQUEST_LENGTH)
  File "/Volumes/GitRepos/pyjoulescope/joulescope/usb/device_thread.py", line 218, in control_transfer_in
    return self._post_block('control_transfer_in', (args, kwargs))
  File "/Volumes/GitRepos/pyjoulescope/joulescope/usb/device_thread.py", line 176, in _post_block
    raise IOError(rv)
OSError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Volumes/GitRepos/pyjoulescope/joulescope/driver.py", line 532, in close
    self.stop()
  File "/Volumes/GitRepos/pyjoulescope/joulescope/driver.py", line 745, in stop
    self._usb.read_stream_stop(2)
  File "/Volumes/GitRepos/pyjoulescope/joulescope/usb/device_thread.py", line 224, in read_stream_stop
    return self._post_block('read_stream_stop', (args, kwargs))
  File "/Volumes/GitRepos/pyjoulescope/joulescope/usb/device_thread.py", line 165, in _post_block
    raise IOError('DeviceThread not running')
OSError: DeviceThread not running

If I disconnect the misbehaving USB device then Joulescope resumes just fine.

tannewt commented 4 years ago

It does look like power is lost to the device connected to the Joulescope while it restarts. This is a bit annoying because it breaks the JLink's debug connection to the MCU.

mliberty1 commented 4 years ago

Whatever your USB device is doing, it seems to be preventing USB communication. The Joulescope UI uses USB Control transfers to get the instrument status periodically. Those control transfers should never fail in a working system, but they are. Great to hear that the Joulescope software now gracefully handles this case.

Powering down the sensor or resetting the sensor-side microcontroller does turn off all current shunt resistors, which breaks the IN+ to OUT+ connection. I see two possible cases:

  1. The Joulescope instrument continues to function, but the recovery mechanism causes a reset. I will investigate what I expect to happen. If this is the case, then the software can likely try to recover first before resorting to the cautious "full reset" approach.
  2. The Joulescope instrument loses the USB connection or detects a USB suspend condition (loss of USB start-of-frame for 3 milliseconds), it must power down the sensor to meet the USB power specification. Not much to do here except violate the USB spec. I could add an "always on" mode that keeps the sensor on once the device enumerates ignoring USB suspend and USB reset.

The existing Joulescope UI and Joulescope firmware do not have enough statistics to determine whether (1) or (2) is happening. Fixing this correctly will likely take some time. Do you have any sense of when I would need to have a fix to be useful to you?

I have never seen USB fail like this. Out of curiosity, you wouldn't happen to have a TotalPhase Beagle USB analyzer (or equivalent) to capture the activity, would you? Alternatively, I have not been able to accurately recreate what you are seeing. If I can, then I would be happy to capture the USB activity for you with my Beagle USB analyzer.

tannewt commented 4 years ago

I really appreciate all of your help. The crash prevention changes make it much better. I can live with the JLink reconnect so don't spend too much time on it (and don't violate the USB spec.)

I do have a Beagle but it's only a 12 so I won't be able to capture high speed traffic. (Though I do have a Rhodondendron for a GreatFET that I haven't gotten setup.)

mliberty1 commented 4 years ago

You're welcome, and thanks for the detailed info and your time to help me improve Joulescope!

I currently have both the Beagle 480 (high speed) and 5000 v2 (superspeed). I would be happy to capture and share the USB traffic if it would be helpful. I am curious to see what can cause this behavior on USB, and potentially make Joulescope even better at the same time. I would probably need to replicate what you are doing more carefully since what we did quickly did not duplicate what you are seeing.

tannewt commented 4 years ago

Fancy! I'm currently working on low power in CircuitPython and messing up the USB servicing can cause trouble. My setup is a JLink connected to SWD on a CircuitPython board. The board itself is powered through my custom USB faceplate for the Joulescope. It is similar to yours but has a switch and IC to disconnect the data lines of the USB. I use this because breaking in GDB disrupts USB comms and hoses the bus. The upstream on the faceplate is connected to a USB3 capable hub which also has the JLink and Joulescope plugged into it.

I think you can replicate the USB trouble by halting CircuitPython with GDB. I also see it hiccup as I turn the data lines on. (My debugging setup is documented here: https://learn.adafruit.com/debugging-the-samd21-with-gdb ) Do you have any board with swd that might work with CircuitPython?

mliberty1 commented 4 years ago

I have the Metro M4 Express with the 10-pin SWD connector and a J-Link. That document mentions the SAMD21 and Metro M0 Express. Is that the target processor/board are you currently using?

tannewt commented 4 years ago

I'm actually on the STM32F405 now but I replicated with the SAMD51 (Feather M4) too. The Metro M4 should work with the same instructions. The only change is the JLink command. JLinkGDBServer -device ATSAMD51J19 -if SWD. CircuitPython uses the TinyUSB USB stack for all but Spresense so the behavior is similar on all the boards.

mliberty1 commented 4 years ago

Ok, I will try with the SAMD51 and either Feather M4 or Metro M4 Express over the weekend. I also have some of the black-pill STM32F405 boards, so I could try that, too. I'll post here with what I find.

tannewt commented 4 years ago

Cool! Let me know how it goes!

mliberty1 commented 4 years ago

Summary So far, things are working great and I have not been able to duplicate what you are seeing. I am probably missing something that you are doing in your setup.

Notes

cd pyjoulescope
git pull
python3 setup.py build_ext --inplace
export PYTHONPATH=`pwd`:`pwd`_ui
cd ../pyjoulescope_ui
git pull
python3 setup.py qt
python3 -m joulescope

Here is what the TotalPhase Beagle Analyzer sees on Metro M4 express debugger pause: image

Conclusion

So far, I haven't been able to replicate what you seeing. Do you do anything else to interact with CircuitPython? Here are my known differences from your setup:

mliberty1 commented 4 years ago

Let's get debugging on the Mac. After this step, the only expected difference will be the USB front panel.

I followed your instructions to use JLinkGDBServer/gdb on Mac. I also had to go to System Preferences → Security & Privacy → Developer Tools and enable iTerm, which I use for the terminal.

Terminal 1:

JLinkGDBServer -if SWD -device ATSAMD51J19

Terminal 2:

arm-none-eabi-gdb-py
target extended-remote :2331
monitor reset

To toggle the run status, repeat

continue
CTRL-C

Joulescope keeps working, and the JLinkGDBServer + gdb stay connected.

tannewt commented 4 years ago

Thanks for being so thorough! How are your USB devices connected? I have all three connect through a single USB hub which then goes back to a USB C dock. Only one USB C cable is connected to my Macbook Pro and it's also carrying my monitor video and power.

Don't spend much more time on this. It sounds like a weird thing to me. The one thing I can think of is how long you are leaving CircuitPython paused. I think the power blip may be due to a timeout in Joulescope.

mliberty1 commented 4 years ago

I originally had the Joulescope and Metro M4 express connected to one USB hub and the J-Link connected to a different USB-C port. However, I just redid the test with the Joulescope, Metro M4 Express and J-Link all connected through an Amazon USB 2.0 4-port ultra mini hub. The hub connects to the MacBook using a nonda USB-C adapter. Here's a picture:

circuitpython_setup

I can continue / CTRL-C many times without issue, and I left the SAMD51 paused for minutes without a problem.

This issue is a weird one. One possible issue is your USB hub. Is it powered? If not, how many devices are plugged into it? Do you have another hub you could try?

Since the debugger problem does not currently seem like a critical issue for you, I am going to put it on hold. I am also going to close this issue since we solved the original crash. However, please open a new issue about the debugger if it becomes more critical to you or you notice any other patterns!

tannewt commented 4 years ago

My USB HUB is powered. It is a Plugable USB 3.0 7-port hub. Totally fine with closing it. If it bugs me enough I'll simplify my setup to make it easier to test. Thanks for looking into it!