openhab / openhab-core

Core framework of openHAB
https://www.openhab.org/
Eclipse Public License 2.0
932 stars 429 forks source link

Provide Riemann Sum implementation for items #4439

Open dandjo opened 3 weeks ago

dandjo commented 3 weeks ago

Requirement

I want to calculate the energy consumption of a sensor that only provides me with power data at irregular intervals. This could be, for example, the output of a heat pump, a heat output meter or other time-discrete values. For this, we need to implement an integral function.

Suggestion

Provide a Riemann sum implementation with different alignments (left, right, mid, trapezoidal). Center/mid alignment is the best compromise between accuracy and implementation complexity.

Riemann Sums

Implementation

I provided a Riemann sum mid implementation in this discussion:

Arithmetic mean vs average linear/step interpolation

Your Environment

runtimeInfo:
  version: 4.3.0
  buildString: "Build #4362"
locale: en-AT
systemInfo:
  configFolder: /etc/openhab
  userdataFolder: /var/lib/openhab
  logFolder: /var/log/openhab
  javaVersion: 17.0.12
  javaVendor: Ubuntu
  osName: Linux
  osVersion: 6.8.0-1013-raspi
  osArchitecture: aarch64
  availableProcessors: 4
  freeMemory: 83704192
  totalMemory: 605028352
  uptime: 29022
  startLevel: 100
addons:
  - automation-jsscripting
  - binding-enigma2
  - binding-http
  - binding-mielecloud
  - binding-modbus
  - binding-mqtt
  - binding-netatmo
  - binding-tado
  - misc-openhabcloud
  - persistence-influxdb
  - persistence-mapdb
  - transformation-jsonpath
  - transformation-regex
  - ui-basic
clientInfo:
  device:
    ios: false
    android: false
    androidChrome: false
    desktop: true
    iphone: false
    ipod: false
    ipad: false
    edge: false
    ie: false
    firefox: false
    macos: false
    windows: false
    cordova: false
    phonegap: false
    electron: false
    nwjs: false
    webView: false
    webview: false
    standalone: false
    pixelRatio: 1
    prefersColorScheme: dark
  isSecureContext: false
  locationbarVisible: true
  menubarVisible: true
  navigator:
    cookieEnabled: true
    deviceMemory: N/A
    hardwareConcurrency: 12
    language: en-AT
    languages:
      - en-AT
      - en-GB
      - en-US
      - en
      - de
    onLine: true
    platform: Linux x86_64
  screen:
    width: 3440
    height: 1440
    colorDepth: 24
  support:
    touch: false
    pointerEvents: true
    observer: true
    passiveListener: true
    gestures: false
    intersectionObserver: true
  themeOptions:
    dark: dark
    filled: true
    pageTransitionAnimation: default
    bars: light
    homeNavbar: default
    homeBackground: default
    expandableCardAnimation: default
    blocklyRenderer: geras
  userAgent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like
    Gecko) Chrome/130.0.0.0 Safari/537.36
timestamp: 2024-11-04T16:52:32.492Z
openhab-bot commented 3 weeks ago

This issue has been mentioned on openHAB Community. There might be relevant details there:

https://community.openhab.org/t/arithmetic-mean-vs-average-linear-step-interpolation/159892/16

mherwege commented 1 week ago

@dandjo Riemann Sum algorithms will only work properly when the interval between persisted values is constant. We cannot make that assumption. How do you see handling that? Another approximation of the integral value is the calculated average multiplied with the total duration of the considered period. The average actions already do a weighted average (multiplies the value at the left period boundary with the period duration). It is equivalent to a left Riemann sum in the case the interval is constant. A solution could be to enhance the average actions with an optional parameter (left, right, mid, trapezoidal), and then also use that as a basis for the sum actions with an optional parameter. What do you think?

dandjo commented 1 week ago

@mherwege I would disagree. As far as I know, a Riemann sum does not require constant intervals. Of course, the accuracy increases with the shortness of the interval. The Riemann sum Mid in particular best compensates for inconsistent intervals by distributing the weighting to the left and right of the measuring point. The intervals can be quite different and the calculation would still achieve a very accurate approximation. I already use this with a measuring device that only sends me changes when the power value changes. Here, the Riemann sum Mid calculates the same values ​​as the energy measuring device, up to the third decimal place. The average calculation Left does not do this (by far).

I find it a little cumbersome in the API when you have to multiply average values ​​by time to get to the integral, especially since calculation and rounding errors can easily creep in here. I think an easily accessible API would make it much easier to use in rules.

mherwege commented 6 days ago

As far as I know, a Riemann sum does not require constant intervals.

Indeed, my oversight. Most text books kind of assume a constant interval, but it does not have to be. And the way we calculate an average at the moment does not assume it either. It takes the left value.

I already use this with a measuring device that only sends me changes when the power value changes. Here, the Riemann sum Mid calculates the same values ​​as the energy measuring device, up to the third decimal place. The average calculation Left does not do this (by far).

Is the measuring device you get the values from the same as the energy measuring device? That sounds like a contradiction. Because if you store in persistence at change (depending on the persistence service), I would think the left Riemann sum should give you exactly the same value. Note this will not be true with rrd4j as it is bucketizing, but it would be with influx or any of the JDBC datastores. If there is a sensor that does not report all of its changes, I agree mid Riemann sum would give better results. And it would also on an rrd4j database. I can however imagine cases where left will always give better results, e.g. electricity prices which only change every hour for dynamic tariffs. But they stay constant within the hour. You don't have a continuous function.

rounding errors can easily creep in here

There may always be some, but the internal calculation is done using BigDecimal. So there should not be much loss of precision.

Do you see a use for the right Riemann sum? I see one for left and mid, not so much for right, trapezoidal and Simpson approximations. In general trapezoidal looks to be less accurate than sum. Simpson is probably not worth the extra calculation overhead.

mherwege commented 6 days ago

As far as I know, a Riemann sum does not require constant intervals.

Indeed, my oversight. Most text books kind of assume a constant interval, but it does not have to be. And the way we calculate an average at the moment does not assume it either. It takes the left value.

@dandjo djo I am coming back on this. I looked at your implementation (https://community.openhab.org/t/arithmetic-mean-vs-average-linear-step-interpolation/159892/15?u=mherwege) and it is not correct if you don't have constant intervals. What you use as a mid value is not in the middle of the period if the bucket before and after the value has not the same length. You could argue it is a good enough approximation for sure, but it definitely is not a mid value. And if there is large variation in bucket length (like when only persisting on change), it may be way off. Question then becomes if a trapezoidal Riemann sum (which does not suffer from this issue) is not better after all. You can treat each bucket independently.

dandjo commented 6 days ago

Is the measuring device you get the values from the same as the energy measuring device?

Yes it is. It's a Nous A1T with Tasmota (calibrated).

That sounds like a contradiction. Because if you store in persistence at change (depending on the persistence service), I would think the left Riemann sum should give you exactly the same value.

Not for my application. I measure the performance depending on the position of the 3-way valve on my heat pump. Then I use the Riemann sum to calculate the consumption for either heating, hot water or standby. In addition, there are a whole range of other measurements for which I can only obtain performance values ​​anyway. I'm using InfluxDB, by the way.

I agree mid Riemann sum would give better results. And it would also on an rrd4j database. I can however imagine cases where left will always give better results, e.g. electricity prices which only change every hour for dynamic tariffs. But they stay constant within the hour. You don't have a continuous function.

Of course, the type of the sum depends on the data you are measuring. In cases where you have irregular measurements, the Riemann sum with midpoint simply works most accurately.

Do you see a use for the right Riemann sum? I see one for left and mid, not so much for right, trapezoidal and Simpson approximations

I think left, mid and right are the most common one to be implemented. I wouldn't put much value on trapezoidal or Simpsons models.

I looked at your implementation (https://community.openhab.org/t/arithmetic-mean-vs-average-linear-step-interpolation/159892/15?u=mherwege) and it is not correct if you don't have constant intervals. What you use as a mid value is not in the middle of the period if the bucket before and after the value has not the same length.

I divide the time interval between a measurement point and the previous measurement point and the next measurement point by 2. This makes it a midpoint approximation. When considering this, it doesn't really matter how long the time intervals between the measurement points are. The calculation simulates a connection between the points. If the connection between two measurement points is not a line but a complex curve, then the accuracy of the approximation decreases. But this is the natural behavior of this Riemann approximation and not an error.

image

dandjo commented 6 days ago

Just compare those two sums.

Left

image

3 + 4 + 3 + 0 = 10

Mid

image

1.75 + 3.75 + 3.75 + 1.75 =11

The difference is quite huge for this simple example. The function integral is about 10.7. The Riemann approximation is much more accurate here, with a relatively coarse measurement distance.

dandjo commented 6 days ago

I looked at your implementation (https://community.openhab.org/t/arithmetic-mean-vs-average-linear-step-interpolation/159892/15?u=mherwege) and it is not correct if you don't have constant intervals. What you use as a mid value is not in the middle of the period if the bucket before and after the value has not the same length.

I'll try to show it graphically. In this example, the measurement points are not equally spaced. The time between the measurement points is halved for each measurement point. We don't know the value between two measurement points, so we want to approximate it. You're right, of course, this isn't a 100% Riemann midpoint sum, it's more of a generic one for non-discrete-time measurements with a midpoint approximation. But I don't see any other mathematical solution. If you have better suggestions, I'd be happy to hear your input. I could test it soon.

image

mherwege commented 5 days ago

I actually think in the graph you have drawn, trapezoidal would be a better approximation. I don't like the double approximation you do (estimating the interval and estimating the value in that point by approximating it as well). While mathematically it still makes sense, it looks more complex to me than trapezoidal approximation, just using the adjacent point of a bucket, the bucket length and no further assumptions on the graph in between. I am implementing all of these, but I don't think there is a golden solution here.

dandjo commented 5 days ago

The graph was just a demonstration. Of course in case of such functions you would need much denser points or grid to approximate better.

I don't understand your point. You also have the approximation with a Riemann left. With the leftpoint variant, the assumption is even greater, since it is assumed that the value remains constant until the next measurement point. In other words, two assumptions are made here too.

The measurement points could also be quantized differently. For example, you could also calculate the mean value of the measurement points and use this to create an area under the curve to the left and right of the resulting fictitious measurement point. But here, too, assumptions are made for value and time. In the end, it's about compensating for the deviations of a linear interaction between two measurement points using the temporal assumption. Depending on the type of data or the shape of the curve function, one or the other works better.

Hence this feature request, in order to be able to offer more as a basis than just a weighted average calculation based on a Riemann leftpoint sum divided by time.

mherwege commented 5 days ago

I am in agreement that left is not the best approximation in many cases, except when you effectively have a step function with a persist on change strategy. In that case it will give an exact result and any other method will give a wrong result.

I have more or less implemented methods now for left, right, trapezoidal and midpoint (your version of it). When I have it tested a bit more, I would appreciate you testing it. Be aware it will first be available in DSL rules. Changes to the scripting libraries will also be needed to make it available there.

dandjo commented 5 days ago

@mherwege Thanks a lot. I would try to access the raw Java code via JS.

Do you have any idea how the midpoint could be implemented differently?

mherwege commented 5 days ago

Do you have any idea how the midpoint could be implemented differently?

No, not without introducing extra computational complexity. As long as the function is relatively smooth and the interval lenghts are not too dramatically different, this should give a reasonable approximation. The deviation from the textbook is that you assume the value at the midpoint of the calculated interval is the sample value, which is not at the midpoint when the intervals on both sides of the sample are not the same. So you basically 'move' the sample point. It will be interesting to compare to trapezoidal, where no sample points are 'moved', but a linear interpolation between sample points is used.

While I now have implemented all of these types (left, right, midpoint, trapezoidal), I am not yet convinced all should be in the software. This increases complexity from a software maintenance and a user perspective. My feeling at the moment is we should only allow left (also for backward compatibility with the average methods) and either midpoint or trapezoidal. It would be easier to explain and the choice for an end user would be more straightforward.

dandjo commented 4 days ago

In my opinion, the implicit choice of Left is a little problematic. It produces relatively poor results for fairly common series of measurements. I would be happy if the functions for Left and Mid were available as integrals (and not as averages). The subsequent calculation of the period is tedious and fundamentally contrary to an intuitive API.

Changing the Average is of course problematic in terms of backwards compatibility. This could possibly be solved with a default parameter. This would make the choice backwards compatible but still clearly documented and visible in use.

mherwege commented 4 days ago

Changing the Average is of course problematic in terms of backwards compatibility. This could possibly be solved with a default parameter. This would make the choice backwards compatible but still clearly documented and visible in use.

That is my intention. Have left as default for backward compatibility and consistency, but as a default parameter which can be changed easily.

mherwege commented 4 days ago

And here is another design decision request: one needs to multiply the values with time for the Riemann sums. I need to choose a time unit for this. While I am doing the internal calculations with millis, I now opt for delivering the result with time unit seconds. My feeling is this is enough granularity. And if the result is a quantity value, it will carry the unit and can easily be converted. Any feedback?

dandjo commented 4 days ago

Hm, but why decreasing accuracy? I measure some Sensors every second (yes, I really do). Is there an issue with Millis or even Nanos?

mherwege commented 4 days ago

You multiply large numbers (the intervals) with most likely relatively small numbers, so there is a loss of accuracy (or you add up a lot of small numbers if your intervals are small). Not every hardware architecture properly works with nanos, so that also creates a loss of accuracy. It could be fractional seconds. But working in QuantityType, something like Ws (from persistence in W) easily converts to kWh for total energy produced/consumed. W*millis is more cumbersome to get right in code.

dandjo commented 4 days ago

Ah, you mean the return value of the calculation. I personally would stick to SI. So seconds are a good choice.

mherwege commented 2 days ago

@dandjo If you want to test, in the description of the linked PR I have included a download link for a compiled jar. You will need to replace the persistence bundle by this one through the karaf console in a running system.

dandjo commented 1 day ago

@mherwege LGTM. Thanks.