manticoresoftware / manticoresearch-buddy

Manticore Buddy is a Manticore Search's sidecar which helps it with various tasks
GNU General Public License v3.0
20 stars 2 forks source link

Issue with manticore version in telemetry #312

Closed donhardman closed 1 week ago

donhardman commented 1 month ago

Bug Description:

Starting with version 6.3.0, we've encountered an issue with telemetry due to a missing label for the Manticore version in our metrics.

We need to investigate and fix this problem.

Manticore Search Version:

6.3.0+

Operating System Version:

Ubuntu Jammy

Have you tried the latest development version?

None

Internal Checklist:

To be completed by the assignee. Check off tasks that have been completed or are not applicable.

- [x] Implementation completed - [x] Tests developed - [x] Documentation updated - [x] Documentation reviewed - [x] Changelog updated
donhardman commented 1 month ago

We've discovered an issue with fetching the version from Manticore. In some cases, when the instance is large and takes a long time to preload, it doesn't respond with the version string when we're loading the metric thread. As a solution, we can try lazy loading of versions and/or add retries. For a more comprehensive fix, we could pass it in Buddy launch as a parameter --versions.

donhardman commented 1 month ago

Lazy loading was implemented here: https://github.com/manticoresoftware/manticoresearch-buddy/pull/315

I discovered that about 9 months ago, we updated Buddy to print its version with a 'v' prefix, which was causing an issue.

In some rare cases, Manticore might not load the Buddy version in time when we send a request from Buddy. This led to some valid metrics, which is an unusual and rare condition. To address this, I've extended the regex to capture v-prefixed versions and improved it. I've also implemented lazy loading to eliminate daemon errors on fast startups. I've validated that it works as expected and refactored some things in the metrics, which should help.

This is a critical issue for metrics collection, and I think we should test and release it as soon as possible.

donhardman commented 1 month ago

I think it's a good idea to write a test that does the following:

  1. Set --telemetry-period=1 or some short interval for reports
  2. Make sure we set --debugv
  3. Verify that we have at least 5 rows with labels printed, all metrics are present, and no snapshots are skipped
  4. Repeat the same test with a more resource-intensive instance

For columnar, we should have: columnar_version, knn_version, secondary_version

WITHOUT columnar, we should only have: manticore_version, buddy_version

These last two labels must always be present.

sanikolaev commented 1 month ago

@donhardman the PR has been approved. Pls merge and we'll go ahead with the tests.

donhardman commented 1 month ago

The issue was merged. Let's proceed with the tests.

sanikolaev commented 1 month ago

@PavelShilin89 pls look into this issue to develop tests. If smth is not clear, pls talk to @donhardman

PavelShilin89 commented 2 weeks ago

Done in https://github.com/manticoresoftware/manticoresearch/pull/2512