Open akshaymhetre opened 5 years ago
URGENT - This is a VERY high priority. The fix for this impacts the entire effort. Please advise the ETA as soon as you can on the call Monday (its fine to hold the fix until after the holiday). Thanks in advance .
Assigned to @SPParnerkar but this will be managed as a team activity.
Reminder - High Priority Red jeopardy in progress
Noted. This item is under close monitoring every 2-hour. Updates will be shared, starting 12 pm today.
12 pm:
Moved Gateway to another node. That node works fine but Tempus node crashed - Kubelet hangs on the node. At this stage, we suspect, that its a memory management issue - too many pods running in a limited memory situation.
Next step, stop QA cluster, Ni-Fi and Cassandra to free up node memory and confirm if memory is the real issue.
2 pm:
Confirmed - Tempus node runs out of memory, CPU utilisation > 99%. Potential causes of excessive memory utilisation to be investigated - potentially a DB latency might be causing data to be buffered by Tempus leading to memory overflow. However, same processing on a bigger set of data was carried out before April and it ran successfully.
Currently, investigating if the tag frequency changes carried out in April might be leading to this situation. To compute tag frequency, an aggregation query over last 1 minute is performed for each insert of a ts_kv record.
4 pm:
Cassandra logs continue to flag warning - "aggregation query without a partition key in where clause'. This indicates that tag frequency changes must be causing the issue. We don't know yet, if there are other contributing reasons.
To continue with investigation, we need to do one of these 3 things:
Himanshu can do #1 and #2 quickly. But he will be available tomorrow only. Today, we can go with #2 [To be decided during Stand up today].
11.30 am:
Implemented the fix for Telemetry data. It will prevent aggregation queries from being executed. This has led to significant improvement in performance. We will now make the same change for Depth Series. Unless, we identify other contributing causes, this issue should get resolved by end of today.
This is a temporary fix. In the next 2-3 days, we will remove the existing functionality from the system completely.
End state solution to compute tag frequencies is being implement as part of a 1.5 Deliverable.
3.00 pm:
Besides Tag frequency issue, some more issues were found with respect to implementation of Time Zones. These have been deployed and the Gateway is now up and running for one hour. We will delete the Gateway again at 5.00 pm and observe for two more hours, If no new issues are found, we can degrade the priority to Medium.
Final set of fixes are still estimated to take 2-3 days.
Would you say this item is closed or downgraded to medium priority?
Update - The PR for this item is in review. Jeopardy will be updated to Yellow pending ETA. @SPParnerkar please not the planned ETA.
Investigate root cause of the issue of Tempus getting slow down while ingesting high frequenct data objects from gateway