magico13 / PyEmVue

Python Library for the Emporia Vue Energy Monitor
MIT License
185 stars 36 forks source link

Frequent "None" usage #39

Open webdeck opened 1 year ago

webdeck commented 1 year ago

I have three different types of Emporia devices in my house, all on the same account:

I poll the devices once per minute and log the data.

The usage data for the Gen 2 Vue circuits is reported accurately.

However, for both the Utility Connect and the Smart Plugs, the usage data gets reported as None frequently - here is an example:

Screen Shot 2022-09-07 at 3 34 32 PM

And here is the channel object compared for a valid and invalid result:

instance(VueDeviceChannelUsage):
  channel_multiplier: 1.0,
  channel_num: '1,2,3',
  channel_type_gid: 0,
  device_gid: 72622,
  name: 'Main',
  nested_devices: {
  },
  percentage: 5.7463256452084455,
  timestamp: datetime.datetime(2022, 9, 7, 22, 31, 5, tzinfo=tzutc()),
  usage: 0.0027398242786885723

instance(VueDeviceChannelUsage):
  channel_multiplier: 1.0,
  channel_num: '1,2,3',
  channel_type_gid: 0,
  device_gid: 72622,
  name: 'Main',
  nested_devices: {
  },
  percentage: None,
  timestamp: datetime.datetime(2022, 9, 7, 22, 32, 5, tzinfo=tzutc()),
  usage: None

Based on my historical graphs, it looks like this started happening on August 11th, around 9am Pacific Daylight Time.

I upgraded from 0.16.0 to 0.16.1 and it didn't fix it.

magico13 commented 1 year ago

The same problem is plaguing users of the Home Assistant integration as well, though this lower level report is extremely helpful. I'm pretty sure that's what the API is returning. You may be able to repoll and get a correct value but I haven't tested if that will actually help yet. The app does this too sometimes so I'm pretty sure it's an underlying API issue.

webdeck commented 1 year ago

So I altered my code to ignore "None" results (perhaps I'll add retries later), and that at least smooths out the data for the Smart Plugs.

However, the Utility Connect is still reporting inconsistent values. Here is a comparison of the data returned by the Sense I have installed, and the Utility Connect. Note that I have solar panels, which is why it goes negative during the day. You can see how the Sense data is smooth, but Utility Connect bounces up and down minute to minute:

Sense: sense

Vue: vue

webdeck commented 1 year ago

I was able to pinpoint when the Utility Connect started reporting this bad data - August 11, 2022. Here are the graphs for August 10th and August 11th where you can see the change:

graph php-3

graph php-4

webdeck commented 1 year ago

Well, I don't know what's going on, but here is some more data. At fifteen seconds after each minute, I call get_device_list_usage for Scale.MINUTE.value for a time that is the same minute at zero seconds. I print that out, and print out the kW value of channel.usage * 60000.0

Then I call get_chart_usage for the previous minute with Scale.SECOND.value and look at the returned array of values.

Now, when the data is good, I get 60 values back in that array, and if I sum them up, the result is pretty close to the value I got for get_device_list usage. Here's an example output:

get_device_list_usage: 0.027968704656530306 -> 1678 kW get_chart_usage: num=60, sum=0.0279822235276964 -> 1679 kW

Now when the data goes bad, it gets interesting. I get a bad value from get_device_list_usage, and then I get fewer than 60 results from get_chart_usage. If I sum the get_chart_usage and then scale it up to be equivalent to 60 values, then I get the right value:

get_device_list_usage: 0.01881660398338903 -> 1129 kW (WRONG) get_chart_usage: num=11, sum=0.004333611199273004 -> 260 kw, scaled=0.02363787926876184 -> 1418 kw (RIGHT)

And the get_chart_usage data can also go bad, where it has a number of zeros instead of real values. If I remove the zeros and scale back up to 60 values, then I get the right value:

get_device_list_usage: 0.02685500121166088 -> 1611 kW get_chart_usage: num=60, sum=0.02040777918158638 -> 1224 kW, zeros=15, scaled=0.02721037224211517 -> 1633 kW

So, I don't know what is going on with my Vue Utility Connect. Both the get_device_list_usage and the get_chart_usage calls are returning invalid data, with chart data missing or replaced with zeros, and the device_list_usage being wrong whenever get_chart_usage returns missing data. Emporia support had me re-provision my Utility Connect with PG&E, but that didn't fix anything.

Is anyone else seeing this, or is it just me?

magico13 commented 1 year ago

I've had all sorts of bad data from the API on the Home Assistant side and there are a number of people reporting worse experiences than I have personally. I'm led to believe it's something with their backend and loading/retrieving values just sometimes being delayed. I'm guessing all the values from the device are thrown into a queue and loaded into the database as fast as the queue can process but sometimes it can run a bit slow and you can see a minute or more of delays, so when you query for that data it has defaulted values (None or 0) until it can catch up and put the correct values in there.

I've seen this with their app, where it will report 0 for the live 1S data on both the overview page and on the charts but looking at the historical chart data it's properly populated a few seconds in the past. I haven't had the opportunity to investigate the app when the delay is larger, but I also don't pull more fine-grained data than the last 1-minute usage personally.

webdeck commented 1 year ago

So here is the code I'm using to get a more accurate value:

  def getMeterWatts(self, channel):
    now = datetime.utcnow();
    end = datetime(now.year, now.month, now.day, now.hour, now.minute, 0, 0)
    start = end - timedelta(minutes = 1)
    end = end - timedelta(seconds = 1)

    try:
      kws, ts = self.vue.get_chart_usage(channel, start, end, Scale.SECOND.value)
    except Exception as e:
      print(e)
      return None

    total = sum(kws)
    zeros = 0
    for kw in kws:
      if (kw == 0):
        zeros = zeros + 1

    return total * 60 / (len(kws) - zeros)
DabblerIOT commented 4 months ago

I closed issue 61 (I opened a duplicate). I am a new user and ran into this again. And think I can suggest a fix for the main meter on a vue2 having the periodic 0 values recorded. In short when the emporia api responds to the getDevicesListUsages command, it will respond with an array of json length 1 with all the values 0. Typically the array of json objects has to match the length of meters requested or configured or the response is an api error and not valid.

Will test more locally as in my ha setup is having this issue regularly and I have seen the response messages using another tool that uses emporia api. Screenshot 2024-02-10 at 7 06 47 PM

DabblerIOT commented 4 months ago

Ok. I have performed more testing. And this morning caught a "bad" api response with some test code. (Don't have a good ha / python dev setup). Same request... response to DeviceListUsages called on a minute basis. (Had to run it almost 24 hours yesterday to catch one bad message, happened around 630 am eastern time). goodapiresponse.txt badapiresponse.txt

Second thing. I am not a python expert (read as I am not good), but I added a very hacky / improper error check into a fork to test. So far my meter for last 24 hours has been clean. I know what I did isn't right, but it dumps the bad message instead of getting a zero in the log, and will dump correct messages for plugs and other things probably. Since I did this my meters into HA have been clean (no zero readings... only 24 hours run so far, used to get many a day. I think where you see data gaps (around 6 and around 4 in meter 1, are probably the code working. Unsure.) Screenshot 2024-02-14 at 2 19 50 PM

I added this to my pyemvue fork for a quick test... around line 104. """Sanity Check, number channels returned = number channels configured""" if (len(device["channelUsages"]) > 1): populated = VueUsageDevice(timestamp=timestamp).from_json_dictionary(device) devices[populated.device_gid] = populated

I really wanted .. if Len(channel usages) == len(requested channels from device data) and requested data of first meter != null... but didn't think there was a global holding the config data of in pyemvue and didn't want to mess with ha code for quick test or make another rest call every time to sanity check the api return, which I'm pretty sure is caching the device data.

I think it would be nice if main developer could add a proper check either in pyemvue (where it could retry the api command with a retry limit) and then pass valid data back to ha (rather not see data gap. During certain times of day I've seen having to make the same request as many as 5 times. But some of that could be because of additional issues with other 3rd party api I found first.

Neither here nor there, but I removed all my plugs from my emporia account today. It seems frequency of "bad api" returns are more with the more devices you pass into the getDeviceUsages api call.

Neither here nor there, was thinking... is everyone (including emporia) hitting their api server on the change of minute? If so could we help things out and hopefully get a less percentage of bad messages by requesting 30 seconds into the minute instead? (It may already do this) We being the royal we of course. :)

Cheers. Hope this helps get issue properly resolved. I'll keep my test running and monitor it to see if any other "bad response" message cases exist. (javascript doing api request through a proxy to get rid of cors).

Sorry for getting my commit tagged above, put comment with error number in fork and git has alot of magic it does. :)

DabblerIOT commented 4 months ago

If any emporia smart guys are here... here is the headers from the message that failed.

Would be nice if.. 1) You could check your api server and see what is going on that it is returning malformed messages. (Assume call to get data is crashing, or that part of your service is running out of threads or something, crashing then just returning the stubbed out response as valid) 2) If you could make. your api server accept '*' instead of 'emporia-energy.com' that would be helpful. It really doesn't protect you much enabling cors like that, and it causes a pita for outsiders that want to request your api data from javascript. Ex: Localhost:3000 is my no cors proxy server.

Screenshot 2024-02-14 at 2 53 31 PM

Thanks. (Worth a try)

DabblerIOT commented 4 months ago

24 hours... data filter looks like good... I want an email every time it blocks... :). Any easy way in python / ha?

Screenshot 2024-02-14 at 7 55 15 PM

Will add retries to my test code and work on better logic for bad packet until better developer can help. (don't care about my ha data yet..). Date gaps and having "unknown" in logs really bothers me.

Test to see if any other fail cases exist running as well. Have 6 plugs on desk, will plug in around and make new account and see if plugs are more problematic.. 6 devices...

Once again thanks!

DabblerIOT commented 4 months ago

Sorry all, ocd is killing me... added some ugly retry code.. running now. Hopefully works. Would like to sit with python developer and see why && and ++ were too good for him. Lol.

Not really helpful to pound a struggling server 10 times for valid data when its already struggling, but wtf... I tried. Each instance of their app is setting their server up for a stress test.

I think if usage for first meter in response is null is valid check for all messages which I can implement without help, but for this test how it is will do what it needs to do. Will cause alot of retries of anyone tries with plugs or single channel devices on their account.

9:10 pm eastern set hopefully no zeros or data gaps :). Will update tomorrow this time.

    retries = 0
    success = False

    while ((retries <= 10) and (success == False)):
        response = self.auth.request('get', url)
        response.raise_for_status()
        devices: dict[int, VueUsageDevice] = {}
        if response.text:
            j = response.json()            
            if 'deviceListUsages' in j and 'devices' in j['deviceListUsages']:
                timestamp = parse(j['deviceListUsages']['instant'])
                for device in j['deviceListUsages']['devices']:
                    """Sanity Check, number channels returned = number channels configured"""
                    if (len(device["channelUsages"]) > 1):
                        populated = VueUsageDevice(timestamp=timestamp).from_json_dictionary(device)
                        devices[populated.device_gid] = populated
                        success = True
                    else:
                        success = False
                        retries +=1
DabblerIOT commented 4 months ago

Fix looks good...6 hours run. Changed code to detect "null" in json, which gets changed to None by json parser in python. Should work for plugs and single channel devices as well.

    retries = 0
    success = False

    while ((retries <= 10) and (success == False)):
        response = self.auth.request('get', url)
        response.raise_for_status()
        devices: dict[int, VueUsageDevice] = {}
        if response.text:
            j = response.json()            
            if 'deviceListUsages' in j and 'devices' in j['deviceListUsages']:
                timestamp = parse(j['deviceListUsages']['instant'])
                for device in j['deviceListUsages']['devices']:
                    """Sanity Check, first channel = null = None after parse in python"""
                    if (j['deviceListUsages']['devices'][0]['channelUsages'][0]['usage'] is not None):
                        populated = VueUsageDevice(timestamp=timestamp).from_json_dictionary(device)
                        devices[populated.device_gid] = populated
                        success = True
                    else:
                        success = False
                        retries +=1

Screenshot 2024-02-15 at 6 15 48 AM

DabblerIOT commented 4 months ago

If any wants to try replace your manifest.json requirements line with..

"requirements": ["git+https://github.com/DabblerIOT/PyEmVue"],

Will put in pull request in a couple days. I see many issues logged under ha-emporia-vue that this would address.

Moving on to see about why the time stamps and timing of updates is so wonky. (Want to see for myself). Thinking about what I would "want" to see given the issues.

Cheers. Sorry to spam everyone. Just glad to have clean data being logged now.

DabblerIOT commented 4 months ago

Caught api server responding with http error 500 for a couple short periods today. Looking into now. Data shows up as "unavailable" in log when this happens, and there is a gap in histories. Different issue than issue with 0 and None values in log. Looking into now.

Short being 10-20 seconds. Ex: from 16:37:23 to 16:37:50 or so GMT today. Looking at code and thinking about adding some retries to this as well. Sigh.

Caught with both pyemvue and my other api test code that was still running.

DabblerIOT commented 4 months ago

Running with retry code in for failed http responses... (401 unauthorized is handled in request logic... which is what happens when token expires, and relogin needs to be done). So only 200 is processed, anything else gets a short delay (assume this is on its own thread because it's all sync). Will probably change it to sleep(5) or accomplish delay another, better way (may need help). Hopefully api doesn't have more failure cases. Will monitor it for a couple more days and leave my test code running (easier for me to see log data and response particulars in javascript console.)

Code is this now.. may need to wrap all requests in some kind of retry mechanism... but from other work I've only ever seen this one function of the api fail multiple times a day.

    retries = 0
    success = False

    while ((retries <= 10) and (success == False)):       
        #Is this correct way to add a delay, blocks thread?   ha python code on its own thread? -DabblerIOT
        if (retries > 0):
            time.sleep(2)

        response = self.auth.request('get', url)
        response.raise_for_status()
        devices: dict[int, VueUsageDevice] = {}
        """Only process response, of 200 (success) response, server fails with 500s (internal server error) from time to time"""
        if response.status_code == 200:
            if response.text:
                j = response.json()            
                if 'deviceListUsages' in j and 'devices' in j['deviceListUsages']:
                    timestamp = parse(j['deviceListUsages']['instant'])
                    for device in j['deviceListUsages']['devices']:
                        """Sanity Check, first channel = null = None after parse in python"""
                        if (j['deviceListUsages']['devices'][0]['channelUsages'][0]['usage'] is not None):
                            populated = VueUsageDevice(timestamp=timestamp).from_json_dictionary(device)
                            devices[populated.device_gid] = populated
                            success = True
                        else:
                            success = False
                            retries +=1
        else:
            success = False
            retries += 1
DabblerIOT commented 4 months ago

Good test data after midnight, above code was generating exception a 500, moved generating an exception on a http error response until after all the retries, if it gives up on an error code. Running since late yesterday. Data good now, no None, No Unknown, no holes. Starting to track down why not consistent one minute data, because the retry logic should only be ever now and again, and in typical fashion should only be 1 retry to get valid data. I added a bunch of integrations, and my data is only ever 2 seconds now. Maybe not enough resources on my vm? Who knows. :). Data attached. Also added retry if we get a blank response !response.text:

Screenshot 2024-02-16 at 6 01 33 AM history-12.csv

Cheers!

DabblerIOT commented 4 months ago

139 - in ha project is related with odd frequency of requests, and tracking. I think it's in the ha scheduler somehow.

Data still good. Think right now have the failure cases and retries sorted out right so data if no internet issues is good. Data for today (one missing data point was a router reboot here -- which is what it should do).

I'll give it until Monday and keep watching it while working on and off on history timing issue, need to learn how integration works with ha more. Pretty sure it isn't in this project. It is pretty simple.

Screenshot 2024-02-16 at 11 28 42 PM

Will post hopefully a clean weekend of readings on Monday before putting in pull request.

As always cheers!!

DabblerIOT commented 4 months ago

Data still good (almost 2 days).

I have seen some occasional (once every now and then). Errors where sensor value was none back to ha code and ha code replaces with last value.

Not sure how this is happening. Sorting out how to log and see log from a python module in ha.. the struggle is real. Still sorting out how ha fits together all the way with the docker containers and all that. Sounds like a good idea but makes it hard to work on for a newbie. Thinking maybe my code that detects a 500 error isn't right.. but then even if it didn't detect a 500 error I'm not sure how a None value could get back with error check code to ha... (shouldn't be possible). Will continue to sort out. May take me longer. (sigh). Getting frustrated, so will probably code my dashboards next and come back to this fresh in a week.

Neither here nor there. The update daily sensors needs to be called sync with the update 1 minute sensors function... what happens now is that while the 1min update update is waiting for api response, the add data to daily async hits and does the update with minute old data. Probably not a big deal (ie: it works). But doesn't really do what you would expect. my ha module is hacked up now in place.. lol. Guess I should fork it and put changes in a pull request.

Cheers.

DabblerIOT commented 4 months ago

I believe that a device other than the first was having a problem getting data from their backend, and returning "null". Changed code to properly check and deal with devices other than the first one returning 0, null, None. Couple hours run so far without unexpected values being passed back to ha. I "think" that if no data comes back, after all the retries there will be an "unknown" logged and a data "gap" in the historical data. (Haven't seen this yet... this would be if the api is down completely.)

Cheers. Hopefully that is it. And there is not another dumb ass way the device api will fail. (Not much hair left!)

DabblerIOT commented 4 months ago

Changes look good so far. Need to verify data when I think api servers are under more load and. unhappy. I think mornings us/eastern time are worst for errors (backups maybe? Or load from other places...). api other than some very odd responses this morning has been very stable. Think all the retries and logic will get us clean data now. :). I have my life back. Lol.

Cheers.

DabblerIOT commented 4 months ago

Error check code didn't catch first device bad, second device good. Fixed (was missing break).

Caught second device bad error on my test code this morning. About 10am eastern.

badapiresponseseconddevicenull.txt

DabblerIOT commented 4 months ago

Good data from earlier.. Kind of cool looking graph.. still watching data hopefully no more new failure cases. May be a gap or two in charts, but it was explained from my bug or other cause. :)

Screenshot 2024-02-19 at 7 55 03 PM

DabblerIOT commented 4 months ago

Put pull Request in for code fixes. My fixes may need to be adjusted a little but logic is good and works. It's fixed for me as in I'm running off my changed repo right now.

DabblerIOT commented 4 months ago

Running 0.18.1 now. I'll watch data close this next week. Code looked good to me (pretty sure it will accomplish the goal).

:)

webdeck commented 4 months ago

Running it as well. My only concern is that it is spamming my output with "Attempt 1 to get device usage"

DabblerIOT commented 4 months ago

Screenshot 2024-02-24 at 5 37 33 PM

Clean data for a while... think 3-4 days since I stopped tinkering with it.

DabblerIOT commented 4 months ago

I am clean overnight running 0.18.1.

Thanks magic.. you are going to have quite a few issues to close on your ha project. (think this issue is duplicated there many times)

I'll open an issue on the ha project with a couple things I found with ha project.. (units fix and how minute data is added to daily data).

DabblerIOT commented 4 months ago

Been running 0.18.2 and latest ha release. All good. I'm going to stop checking here for now. Thanks again for integration and hard work.

Cheers.

DabblerIOT commented 4 months ago

latest updates from projects... missing data again. I didn't look too close, but unless the api is failing really bad right now, last changes from 18.1 to 18.2 weren't helpful. I don't like the idea of taking half a valid message and processing it. I think if you don't get a 100% complete valid response back from the api you should discard it. I'll keep running for awhile before falling back.

Screenshot 2024-02-26 at 4 50 21 PM

DabblerIOT commented 4 months ago

One device good, one device bad same minute. If you pass back only one device this is what you will get. :(

Data attached: history-66.csv

Screenshot 2024-02-27 at 12 30 25 PM

DabblerIOT commented 4 months ago

@magico13 you going to revise code or should I fork, sort out a dabbler fix and then pull request? Let me know. Have had more occurrences. Less than before, but still about 1 per day. 18.1 and my fixed fork were 0.

magico13 commented 4 months ago

Sorry my notifications must have changed because I'm only now seeing this. Glad it's improved from before at least but sounds like additional retries for the usage call may be warranted, I don't think my code tries for as long as yours did.

I don't think the changes between 0.18.1 and 0.18.2 would result in much difference, with .1 you'd get a fully empty data set back and it would set everything to None/unavailable, but with .2 only the one(s) still missing after all of the polling attempts will be None. The Home Assistant integration has some code to take the first None it sees for a sensor and just duplicate the previous data, the second None will be kept since the device might be offline. With all the retries built in now I might change that and either throw out all of the data or at least ignore any None data within a partial response. Gotta be careful not accepting partial responses because if someone unplugs a smart outlet (or it fails, like I had happen to one of mine) they don't want their main Vue to suddenly stop reporting.

I likely won't get a chance to do more work on it until sometime this weekend, you're welcome to take a crack at it before then if you want.

DabblerIOT commented 4 months ago

I didn't think of that use case... maybe treat panel meters like first class citizens, always expecting valid data on mains? And plugs second class citizens, where if there is a data issue (switched outlet, wife unplugged meter to turn something off), etc... it won't cause data loss on mains?

Was talking to myself while sorting out. :). Assumed you were just waiting to see pull request. Problems like this are a pita.

I have a stack of plugs I can add to things if we need to test / debug. Have been using them when trying to figure out where usage goes on certain circuits.

Any thought on one request for panel meters with one set of rules, and another request for plugs? May stress api less if we request with less meters (reduce chance of getting missing data in one request by smaller request?)

Cheers. thanks again for your hard work on this.

DabblerIOT commented 3 months ago

@magico13 18.4 retries are good... getting many (say 100 per day) bad responses on plugs, some (4 per day) bad responses on vue2... no down devices... can unplug a plug on my peloton coat rack to test if needed. :)

No data gaps or unknowns on plugs (2 little loads), and vue2s... Just would like to say I walked at gym today, and my wife is recovering from an injury and running an electric heater when icing.. :(. Screenshot 2024-03-07 at 10 01 56 PM