hashmapinc / Tempus

Hashmap IIoT Accelerator Framework
Apache License 2.0
29 stars 10 forks source link

Investigate root cause of the issue of Tempus getting slow down while ingesting data from gateway #844

Open akshaymhetre opened 5 years ago

akshaymhetre commented 5 years ago

Investigate root cause of the issue of Tempus getting slow down while ingesting high frequenct data objects from gateway

niaalex commented 5 years ago

URGENT - This is a VERY high priority. The fix for this impacts the entire effort. Please advise the ETA as soon as you can on the call Monday (its fine to hold the fix until after the holiday). Thanks in advance .

niaalex commented 5 years ago

Assigned to @SPParnerkar but this will be managed as a team activity.

niaalex commented 5 years ago

Reminder - High Priority Red jeopardy in progress

SPParnerkar commented 5 years ago

Noted. This item is under close monitoring every 2-hour. Updates will be shared, starting 12 pm today.

SPParnerkar commented 5 years ago

12 pm:

Moved Gateway to another node. That node works fine but Tempus node crashed - Kubelet hangs on the node. At this stage, we suspect, that its a memory management issue - too many pods running in a limited memory situation.

Next step, stop QA cluster, Ni-Fi and Cassandra to free up node memory and confirm if memory is the real issue.

SPParnerkar commented 5 years ago

2 pm:

Confirmed - Tempus node runs out of memory, CPU utilisation > 99%. Potential causes of excessive memory utilisation to be investigated - potentially a DB latency might be causing data to be buffered by Tempus leading to memory overflow. However, same processing on a bigger set of data was carried out before April and it ran successfully.

Currently, investigating if the tag frequency changes carried out in April might be leading to this situation. To compute tag frequency, an aggregation query over last 1 minute is performed for each insert of a ts_kv record.

SPParnerkar commented 5 years ago

4 pm:

Cassandra logs continue to flag warning - "aggregation query without a partition key in where clause'. This indicates that tag frequency changes must be causing the issue. We don't know yet, if there are other contributing reasons.

To continue with investigation, we need to do one of these 3 things:

  1. Find a temporary code fix to bypass aggregation queries.
  2. If #1 not possible, completely rollback tag frequency changes in production.
  3. Switch to Postges in production for now.

Himanshu can do #1 and #2 quickly. But he will be available tomorrow only. Today, we can go with #2 [To be decided during Stand up today].

SPParnerkar commented 5 years ago

11.30 am:

Implemented the fix for Telemetry data. It will prevent aggregation queries from being executed. This has led to significant improvement in performance. We will now make the same change for Depth Series. Unless, we identify other contributing causes, this issue should get resolved by end of today.

This is a temporary fix. In the next 2-3 days, we will remove the existing functionality from the system completely.

End state solution to compute tag frequencies is being implement as part of a 1.5 Deliverable.

SPParnerkar commented 5 years ago

3.00 pm:

Besides Tag frequency issue, some more issues were found with respect to implementation of Time Zones. These have been deployed and the Gateway is now up and running for one hour. We will delete the Gateway again at 5.00 pm and observe for two more hours, If no new issues are found, we can degrade the priority to Medium.

Final set of fixes are still estimated to take 2-3 days.

niaalex commented 5 years ago

Would you say this item is closed or downgraded to medium priority?

niaalex commented 5 years ago

Update - The PR for this item is in review. Jeopardy will be updated to Yellow pending ETA. @SPParnerkar please not the planned ETA.