Closed ThreadDao closed 1 week ago
Currently, a stats task is performed before building the index for segments. Due to the large number of segments, it appears on the monitoring system that the number of unissued index tasks is slowly increasing. In fact, this is because the stats tasks are completing slowly. The reason for the slow completion of the stats tasks is that the connection to MinIO/S3 on the index node is taking too long (more than 10 seconds), causing the datacoord to think that the task assignment has failed. However, the index node has already cached the task. When datacoord reassigns the task, it does not first clear the cache on the index node, resulting in the index node reporting that the task already exists during reassignment. This blocks the task's progress, which explains why tasks execute quickly after the index node is restarted. And pull request #36371 will fix it.
Currently, a stats task is performed before building the index for segments. Due to the large number of segments, it appears on the monitoring system that the number of unissued index tasks is slowly increasing. In fact, this is because the stats tasks are completing slowly. The reason for the slow completion of the stats tasks is that the connection to MinIO/S3 on the index node is taking too long (more than 10 seconds), causing the datacoord to think that the task assignment has failed. However, the index node has already cached the task. When datacoord reassigns the task, it does not first clear the cache on the index node, resulting in the index node reporting that the task already exists during reassignment. This blocks the task's progress, which explains why tasks execute quickly after the index node is restarted. And pull request #36371 will fix it.
Currently, a stats task is performed before building the index for segments. Due to the large number of segments, it appears on the monitoring system that the number of unissued index tasks is slowly increasing. In fact, this is because the stats tasks are completing slowly. The reason for the slow completion of the stats tasks is that the connection to MinIO/S3 on the index node is taking too long (more than 10 seconds), causing the datacoord to think that the task assignment has failed. However, the index node has already cached the task. When datacoord reassigns the task, it does not first clear the cache on the index node, resulting in the index node reporting that the task already exists during reassignment. This blocks the task's progress, which explains why tasks execute quickly after the index node is restarted. And pull request #36371 will fix it.
- 10s timeout seems to be too slow
- we need a cancel logic to make sure index task is canceled.
pr #36371 will ensure tasks are cleaned up.
Currently, a stats task is performed before building the index for segments. Due to the large number of segments, it appears on the monitoring system that the number of unissued index tasks is slowly increasing. In fact, this is because the stats tasks are completing slowly. The reason for the slow completion of the stats tasks is that the connection to MinIO/S3 on the index node is taking too long (more than 10 seconds), causing the datacoord to think that the task assignment has failed. However, the index node has already cached the task. When datacoord reassigns the task, it does not first clear the cache on the index node, resulting in the index node reporting that the task already exists during reassignment. This blocks the task's progress, which explains why tasks execute quickly after the index node is restarted. And pull request #36371 will fix it.
- 10s timeout seems to be too slow
- we need a cancel logic to make sure index task is canceled.
pr #36371 will ensure tasks are cleaned up.
is 10s a too short for stats task? maybe we could try to increase to 30s
Currently, a stats task is performed before building the index for segments. Due to the large number of segments, it appears on the monitoring system that the number of unissued index tasks is slowly increasing. In fact, this is because the stats tasks are completing slowly. The reason for the slow completion of the stats tasks is that the connection to MinIO/S3 on the index node is taking too long (more than 10 seconds), causing the datacoord to think that the task assignment has failed. However, the index node has already cached the task. When datacoord reassigns the task, it does not first clear the cache on the index node, resulting in the index node reporting that the task already exists during reassignment. This blocks the task's progress, which explains why tasks execute quickly after the index node is restarted. And pull request #36371 will fix it.
- 10s timeout seems to be too slow
- we need a cancel logic to make sure index task is canceled.
pr #36371 will ensure tasks are cleaned up.
is 10s a too short for stats task? maybe we could try to increase to 30s
I think 10s is enough, as this is only the time for sending the request, not the time to wait for the stats task to complete.
/assign @ThreadDao please verify it. /unassign
cant reproduce
Is there an existing issue for this?
Environment
Current Behavior
test steps
Expected Behavior
No response
Steps To Reproduce
Milvus Log
Anything else?
No response