quest-fw / quest-fw

Apache License 2.0
17 stars 3 forks source link

FastRPC clarification - HexRouter vs KraitRouter #9

Open plusk01 opened 5 years ago

plusk01 commented 5 years ago

It seems that KraitRouterImpl.cpp runs on the Hexagon/aDSP and that HexRouterImpl.cpp runs on the Krait/CPU. Is that correct?

Also, I found your QDN forum question about FastRPC and UART. I am hoping to stream high-rate sensor data (MAVLink messages from ROSflight) from the autopilot on the aDSP to the MAVLink<-->ROS parser (think mavros) on the CPU. Have you had any practical (i.e., bandwidth, latency) limitations with running HEXREF and using FastRPC?

My understanding is that by using rpcmem_* in combination with the standard FastRPC/idl method of calling methods on the DSP allows for zero-copy transfer.

genemerewether commented 5 years ago

Yes, that's correct, except one nomenclature thing: I don't run code on the aDSP (no DSPAL support at this time), but rather the SLPI, on the 8096 (new Snapdragon flight board). The SLPI is where Qualcomm run their flight stack on the 8096.

Are you trying to run ROSflight on the DSP? Or are you trying to use the DSP I/O to talk to a microcontroller running ROSflight? In that Qualcomm ticket that you reference, we were doing something more similar to the latter.

I haven't quantified latency in either direction (Hexagon to Linux, or Linux to Hexagon) using my implementation in Hex/Krait Router. I run both attitude and position control loops on the Hexagon, so any latency wouldn't affect my phase margins. I haven't tried to run a controller on the Linux side - you could always send attitude and thrust setpoints in to my controller from a position controller in ROS (I created a custom message, mav_msgs::AttitudeRateThrust), running only my attitude loop on the Hexagon, because latency in position commands would be less of an issue.

The simplest way for me to get a latency estimate is to use rostopic delay to get the delay of the IMU messages I'm sending - I'll try to do that tonight or tomorrow.

As for bandwidth, I am sending IMU data (acquired at 1kHz, shipping 2 IMU samples per 500Hz control cycle) and odometry (propagated at 500Hz) from Hexagon to Krait without any problems. State and IMU biases come back from Krait to Hexagon at 30 Hz (camera rate, so measurement update rate).

I compensate for latency in either direction by calculating offset between Hexagon and Krait clocks (I wrote a kernel module to read a special register with that value and map it to a file), and correcting timestamps as appropriate. The IMU propagator can handle delayed state updates by buffering IMU samples.

This post has some good info from the Qualcomm folks about cache flushing - I don't know how FastRPC cache synchronization is implemented / differs when you use rpcmem_alloc, but my working assumption is that performance will only get better when using those specialized alloc functions.

Here are some more Qualcomm forum posts that don't have full answers, but that at least to me, paint a consistent story: https://developer.qualcomm.com/forum/qdn-forums/software/hexagon-dsp-sdk/hlos/29823 https://developer.qualcomm.com/forum/qdn-forums/software/hexagon-dsp-sdk/hlos/35087 https://developer.qualcomm.com/forum/qdn-forums/maximize-hardware/multimedia-optimization-hexagon-sdk/computer-vision/27324

plusk01 commented 5 years ago

These are immensely useful insights---thanks.

My plan is to run ROSflight (attitude control/estimation; low-level io) on the DSP and communicate with the CPU/ROS via FastRPC, where the position loop would live. That way if we are using Snapdragon or a companion computer/microcontroller the software architecture is essentially the same.

For hexagon<-->krait clock compensation, I'm curious why you decided to go with the kernel module. It's my understanding that each time you read the sysfs device the walltime_dsp_diff_show method is called and the offset is recalculated. I had assumed that a one time calibration would suffice, which is what the Pixhawk does. Are there any pros / cons that stick out to you between the two methods? I like the idea of having the offset calculation code in one place which can be used by any code base without having to link to a library, as is the case for the ko.

And thanks for the links!

genemerewether commented 5 years ago

The kernel module is standalone - it is just in the repo for convenience. I've always built it using the Yocto / bitbake tooling from CodeAurora; I found that much easier than trying to set up the kernel headers for cross-compiling the kernel module myself. You can use my kernel module for what you want to do without having to link against anything - just install the kernel module with insmod and then read the file.

I did the calculation in the kernel module so that I could lock out interrupts to avoid context switching in the (admittedly small) time / # of instructions between reading the file and getting the time. This method also eliminates any effect on the calculated offset from latency from that file read. I was planning to also add a file with the offset CLOCK_MONOTONIC <-> CLOCK_REALTIME, as I use that in the camera components, and they are still vulnerable to such context switching at the moment.

I repeatedly get the offset in case something (like chrony / ntp) is slewing or stepping the Linux clock. That slew / step would invalidate an offset determined only once.

I haven't used ROSFlight myself, but issues like the following make me wonder why you want to use it for low-level estimation/control/io on the Snapdragon if you have to implement support yourself? Is it so you can keep tuning parameters and device support the same as some other setup you have? That seems like a stronger argument than software architecture, except in the most general sense (high-level computer connected to low-level one), with split-level estimation, control, etc. https://github.com/rosflight/firmware/issues/289

The attitude filter / position propagator in this flightstack is modified from: https://www.mdpi.com/1424-8220/15/8/19302/pdf And the controller is the standard nonlinear backstepping one for multirotors: https://deepblue.lib.umich.edu/handle/2027.42/97267

What interface does your position controller expect on the Linux side? I'm happy to take a look at what it would take to interface my flight stack to your position controller... It'll just be a custom ROS message type with attitude, angular velocity, and thrust setpoint. Side-note - we're supporting the TI TMS570 microcontroller with all the same code as above, just interfacing over UART: http://www.ti.com/tool/LAUNCHXL2-570LC43 I'd need to implement more precise PWM on the nHET peripheral, but that's the only piece missing to fly a multirotor with it.

genemerewether commented 5 years ago

Here are some latency stats with a non-RT priority subscriber:

In this case, I'm reading the IMU data at 50 Hz from Linux, using BlspSpiDriver to call the dsp_relay_ functions over FastRPC.

/data/eng/load # rostopic delay /imu_0
...
average delay: 0.002
    min: 0.001s max: 0.010s std dev: 0.00107s window: 1012
average delay: 0.002
    min: 0.001s max: 0.010s std dev: 0.00107s window: 1062
average delay: 0.002
    min: 0.001s max: 0.010s std dev: 0.00109s window: 1106

/data/eng/load # rostopic hz /imu_0
subscribed to [/imu_0]
average rate: 50.600
    min: 0.012s max: 0.027s std dev: 0.00158s window: 50
average rate: 50.595
    min: 0.012s max: 0.027s std dev: 0.00147s window: 101
average rate: 50.594
    min: 0.012s max: 0.027s std dev: 0.00143s window: 151
average rate: 50.581
    min: 0.012s max: 0.027s std dev: 0.00134s window: 174

This is from the scenario I described above, where I run attitude (and position) control on the DSP, and pipe the IMU data across to Linux and to a filter.

/data/eng/load # rostopic delay /imu_0
...
average delay: 0.020
    min: 0.005s max: 0.073s std dev: 0.00969s window: 24695
average delay: 0.020
    min: 0.004s max: 0.073s std dev: 0.00965s window: 25745
average delay: 0.020
    min: 0.004s max: 0.073s std dev: 0.00976s window: 26787

/data/eng/load # rostopic hz /imu_0
subscribed to [/imu_0]
average rate: 1031.606
    min: 0.000s max: 0.038s std dev: 0.00399s window: 976
average rate: 1023.100
    min: 0.000s max: 0.038s std dev: 0.00386s window: 2009
average rate: 1020.751
    min: 0.000s max: 0.039s std dev: 0.00406s window: 3024
average rate: 1018.731
    min: 0.000s max: 0.039s std dev: 0.00417s window: 3702
genemerewether commented 5 years ago

I took another measurement to get more detailed stats - I buffer everything I send from DSP to Linux, in order to reduce the number of FastRPC calls, and avoid spinning by blocking in my retrieve call until the transfer buffer is full. This could easily be reconfigured if you would rather return immediately if some lesser threshold for transfer is met (but the buffer still isn't full). Anyway, here are IMU latency measurements in microseconds. This matches with the rostopic delay measurements above, and is consistent with the size of the transfer buffer and the rate at which I fill it.

imu latency 20051 usec
imu latency 19289 usec
imu latency 18478 usec
imu latency 17663 usec
imu latency 16817 usec
imu latency 16003 usec
imu latency 15186 usec
imu latency 14338 usec
imu latency 13520 usec
imu latency 12673 usec
imu latency 11857 usec
imu latency 11039 usec
imu latency 10192 usec
imu latency 9374 usec
imu latency 8557 usec
imu latency 7709 usec
imu latency 6891 usec
imu latency 6074 usec
imu latency 5226 usec

imu latency 19374 usec
imu latency 18645 usec
imu latency 17805 usec
imu latency 16989 usec
imu latency 16141 usec
imu latency 15325 usec
imu latency 14508 usec
imu latency 13663 usec
imu latency 12846 usec
imu latency 12029 usec
imu latency 11181 usec
imu latency 10365 usec
imu latency 9547 usec
imu latency 8701 usec
imu latency 7883 usec
imu latency 7065 usec
imu latency 6217 usec
imu latency 5400 usec