tchiotludo / akhq

Kafka GUI for Apache Kafka to manage topics, topics data, consumers group, schema registry, connect and more...
https://akhq.io/
Apache License 2.0
3.3k stars 638 forks source link

Search feature in topics is not giving expected output #1479

Closed damienfayet closed 1 month ago

damienfayet commented 1 year ago

Hello,

We have some strange behavior while using the search feature.

On big topics with many partitions, the search is not giving consistent results. As an example, searching for a Key containing "7A1BP" will give a different result each time we click on the search button.

The sort is defined to oldest, sometimes it will display first data from partition 1 some other time data from partition 2, etc...

Moreover the total number of result is not working at all and displaying sometimes 13,14, 15 or even 0 (displaying data at the same time)

Expected outcome : a consistent result sorted correctly with a correct estimation of the number of matching results. No limit by partition should be set.

Regards,

alwibrm commented 1 year ago

We're experiencing similar behavior. When searching for keys that have corresponding messages on the broker, we're not able to see any search results in many cases.

John-Athan commented 1 year ago

We are experiencing the same behaviour as @alwibrm . This issue only arises with new versions of AKHQ. We tried an older version which took way longer to search, but eventually consistently got the correct events.

tchiotludo commented 1 year ago

@AlexisSouquiere do you think your change here can have an impact on the subject ?

AlexisSouquiere commented 1 year ago

@tchiotludo nop I don't think so (and the 1st PR isn't mine 😄) because it's related to the oldest sort (on which I never worked). I already talked to @damienfayet and I found that the reason the search doesn't give the same output every time is because we have 1 consumer for all the partitions. By polling each partition with a single consumer we can't control which partition answers first. That's why search doesn't return the same result. Kafka guarantees order in a partition but not across partitions

I don't know if something changed on the oldest sort in the previous version but I can have a look

AlexisSouquiere commented 1 year ago

@John-Athan if you can please give us more details about the difference between the previous version and the current one (sort, filter used, output on the 2 versions or anything that could help us to understand)

One thing I forgot is that the issue @damienfayet is raising isn't only about v0.24.0 because he is still on v0.23.0

tchiotludo commented 1 year ago

ok good catch, my fault, I'm mixing some issues, and your analyse is correct @AlexisSouquiere, that the way Kafka works and the issue reported by @damienfayet is not an issue, we can't do anything about that (even more with big topic where result with single partition would be even more possible, even every time). @damienfayet, Kafka is not a database and we don't have order by possible on multi partition, that the way Kafka is done.

@alwibrm @John-Athan, do you have tried to raise this configuration akhq.topic-data.poll-timeout that is really low by default if you have large topic ?

alwibrm commented 1 year ago

Not yet, but we will try. Thanks for the recommendation. P.S.: the issues seem to come up since 0.24.0.

John-Athan commented 1 year ago

Thank you for your suggestion! We tried setting the poll timeout higher (we tried the 5000ms, 10000ms and even 100000ms, the latter caused errors because we ran into Gateway Timeouts). This caused the search to go a bit faster, but still no results.

If we check Chrome's Dev Console, we can see that the EventStream for the search skips from a small percentage of progress to 100%. This seems odd IMO. image

For reference: We ran into the error with AKHQ 0.23.0 and 0.24.0. The same search ran fine on AKHQ 0.18.0. We could check for which AKHQ version the error first appeared, if this information is relevant for narrowing down the issue.

EDIT: Found out that the issue first arises with 0.23.0. Found this MR, this seems very related: https://github.com/tchiotludo/akhq/pull/1468

damienfayet commented 1 year ago

Hello @tchiotludo, I know Kafka is not a database :D and I'm not the one using it like this but many users in our company are. But the feature exists and perhaps we can think about another way to write it. It's difficult to explain to users that the current behavior is "normal".

The issue is not about the order of the results, I agree it's not somehow possible. I'll have a look with @AlexisSouquiere but I really think we can have a better search feature. I would say for me that the minimum is to have a consistent answer at each run in terms of number of results. The challenge will be to keep used resources at an acceptable level.

r-roos commented 1 year ago

I am curious if there are any updates on this issue? From my own testing I can confirm that searching is basically very unreliable at the moment and you have no guarantee the data you are searching for is displayed at all.

The type of search I perform the most is searching on a specific key from a specific date&time. In current state it can happen you start searching data from 2 weeks ago, you find a couple of results lets say 14 and 13 days ago and searching just stops. Once you change your start date to 13 days ago you suddenly find a couple of results from 12 days ago and search stops again.

tchiotludo commented 1 year ago

can you try to increase akhq.topic-data.poll-timeout to largest value to see if it's works?, see here

AlexisSouquiere commented 1 year ago

I think I can give more consistency to the search results in both oldest and newest sort. I did some tests on the newest sort and I should be able to apply the same philosophy to the oldest sort.

With the newest sort we have 1 consumer by partition that tries to consume an amount of records which is max.poll.records / partitions number. Depending on the amount of data in each partition you can have less than max.poll.records records because one partition doesn't have records to consume anymore. Instead of giving this max.poll.records / partitions number to each consumer, we can simple give max.poll.records. I means that for a topic with 3 partitions and max.poll.records = 15, each consumer will consume 15 records instead of 5. BUT we won't return these 15*3=45 records instead of the desired 15 records. We will keep the actual behaviour that sort records by the timestamp and simply keep the first 15th records.

Because the pagination link that gives start offset for the next call is built after, fetching more than the desired number of records is not a problem. Of course for each call to the search feature we will fetch more records than expected. It will imply a longer execution time (probably insignificant) but the result will be consistent for users.

If we apply the same logic to the oldest sort, one consumer by partition instead of the single consumer that produces inconsistency because we don't know which partition will answer first, we should be able to solve this "issue"

@damienfayet, @tchiotludo WDYT ?

r-roos commented 1 year ago

can you try to increase akhq.topic-data.poll-timeout to largest value to see if it's works?, see here

I have tried 10000 and 60000 and experienced no change in behavior as previously experienced.

AlexisSouquiere commented 1 year ago

I initialized a branch on my fork to work on it and try to propose an alternative (https://github.com/AlexisSouquiere/akhq/tree/feat/improve-topic-data) with the same logic (1 consumer by partition to be sure the same request will produce the same result). I don't want to create a PR now because the new release is imminent and it's too impacting but I will do more tests with concrete topics and data in the coming weeks. Then I'll create the PR

tchiotludo commented 1 year ago

@AlexisSouquiere does this lead to a slow search then ? I've almost sure it will :cry:

AlexisSouquiere commented 1 year ago

I'll monitor the search time also but I didn't experience any slowness yet. I'm not really sure it will (the x consumers are polling in parallel and even if we are polling more records, the difference shouldn't be noticeable). That's why I need to take time to benchmark the 2 approaches :)

alwibrm commented 9 months ago

Hi, is there any progress in the improvements for the search function? It's one of the most crucial features for us, but currently hardly usable. Can we provide any information or contribution?

AlexisSouquiere commented 9 months ago

@alwibrm I worked on it recently and deploy my branch in my company to do some deeper tests with real data. The search seems to work better. Would it be possible for you to do the same with https://github.com/AlexisSouquiere/akhq/tree/feat/improve-topic-data ? It would be interesting to have more feedback on it

For a faster search on bigger topic, you can increase akhq.clients-defaults.consumer.properties.max.poll.records to a higher value (50 by default). It's the number of records polled each time until we got enough result or reached the end of the topic

I think I'll create a discussion to discuss about this topic and what are the changes that I did

alwibrm commented 8 months ago

Update: we got it working. Unfortunately the login via OpenID wasn't possible anymore after using your feature branch. The AKHQ log states a HTTP 200, login successful, but you don't get redirected after the login. Instead you're stuck on the login page.


Hi @AlexisSouquiere, Sorry for the late response. Today we tried to build a Docker image from your fork. Unfortunately we're not able to get it working. Here is what we did:

1.) git clone https://github.com/AlexisSouquiere/akhq.git 2.) ./gradlew shadowJar 3.) cp build/libs/*-all.jar docker/app/ 4.) docker build -t akhq-improved-search . 5.) docker run -p 8080:8080 akhq-improved-search

When we do so, the following output is shown in the logs:

Error: Could not find or load main class org.akhq.App
Caused by: java.lang.ClassNotFoundException: org.akhq.App

Are we missing anyhting?

AlexisSouquiere commented 8 months ago

@alwibrm please try in 2) ./gradlew classes testClasses --parallel --no-daemon and then ./gradlew shadowJar distTar distZip --no-daemon after checking out the feat/improve-topic-data branch

alwibrm commented 7 months ago

Hi @AlexisSouquiere, Sorry for the late reply. Unfortunately I get the same ClassNotFoundException when compiling with the suggested commands.

alwibrm commented 7 months ago

@AlexisSouquiere Short update on this: we got it working by renaming the resulting JAR file in step 3 to "akhq.jar". With your feature branch the OpenID login wasn't working anymore (loop on login page). Therefore we disabled security for testing.

The search by NEWEST looked promising on topics with 1-3 partitions. On topics with for example 18 partitions the search ran infinitely, although they only contained 25 records in total. Also the search by OLDEST didn't seem to work, but I guess that's what you mentioned earlier in this discussion.

Have you tested with greater amounts of partitions, too?

jonasvoelcker commented 6 months ago

Hey @AlexisSouquiere,

we finally managed to run the branches locally (FYI: @alwibrm) and now we see differences in loading the messages in one of our "problem-topics".

Whereas the dev-Branch shows each of the 5 messages your feature branch only shows 4 of them. ;)

Best Regards Jonas

alwibrm commented 1 month ago

Can this issue be closed due to several fixes that came in 0.25.0?