vespa-engine / vespa

AI + Data, online. https://vespa.ai
https://vespa.ai
Apache License 2.0
5.68k stars 593 forks source link

Field of data type long shows up with Long.MIN value during TTL of documents #27177

Open wsandeepd opened 1 year ago

wsandeepd commented 1 year ago

Describe the bug We’re running into an issue with one of our Vespa schemas for which TTL of documents is enabled - the documents are TTLed 36 hours after ingestion after one version of the dataset is ingested. We’re observing a strange behaviour while the TTL is in progress where one of our fields snapshot_version which is of type long shows up as Long.MIN. We know for a fact that our ingestion code doesn’t ever set this value to a negative number as this is ingested as a timestamp value from upstream. When we run a YQL like this, where x is one of the snapshot_version values present in the records -

select * from my_schema where snapshot_version = x limit 0 | all(group(snapshot_version) each(output(count())))

we see that Vespa returns this negative value in one of the groups, while the others are expected values.

Also, one other observation is that even when the YQL filters on one of the expected snapshot_version values, the values returned include those from the Long.MIN snapshot version. Due to this behaviour, we’re seeing inconsistent values for other fields from the result of our grouping/aggregation queries.

To Reproduce

Setup garbage collection on schema:

<documents garbage-collection="true" >
          <document type="my_schema" mode="index" selection="my_schema.timestamp &gt; now() - 129600" />
          <document-processing cluster="feed" />
</documents>

Define schema as below:

field my_schema type long {
            indexing: attribute
        }

        field version type string {
        }

        field snapshot_version type long {
            indexing: attribute
            attribute: fast-search
            rank: filter
        }

        field timestamp type long {
            indexing: attribute
            attribute: fast-access
        }

Run YQL as TTL happens

select * from my_schema where snapshot_version = x limit 0 | all(group(snapshot_version) each(output(count())))

While TTL happens, it's observed that a snapshot_version of Long.MIN also shows as one of the grouping results even though it's not provided in the filter.

Expected behavior The grouping result with snapshot_version = Long.MIN shouldn't appear in the results as it was not provided in the YQL filter clause.

Environment (please complete the following information): OS: Linux Infrastructure: Kubernetes

Vespa version 8.128.22

Additional context I have confirmed that we don’t ingest/have documents where this attribute misses a value as we ingest this value from upstream computed as the current timestamp. Also, pls note that this behaviour happens only during ingestion and works perfectly fine otherwise.

I wanted to add a couple of observations while documents are getting TTL’ed. Upon running our YQL, I can see results where snapshot_version is Long.MIN but those documents vanish away (that’s when they’re really evicted from memory?) before new ones show up. This behaviour continues until all the documents are TTL’ed.

jobergum commented 1 year ago

One possible explanation is that a grouping query is not one atomic read operation, and some state is preserved between t1 and t3 in the stateless container.

Plausible @bjorncs @baldersheim ?

wsandeepd commented 1 year ago

@jobergum but the documents that were deleted belong to another snapshot_version y older than x. So shouldn't they have been filtered out in the first place at t1?

jobergum commented 1 year ago

How can you be sure about that since your expire is not on snapshot_version?

wsandeepd commented 1 year ago

@jobergum we create a new snapshot_version for every ingestion (typically, we have one ingestion per day) and we query for the latest snapshot_version known to us through another service that keeps track of the versions. also, when we ran the YQL successively, it was observed that the number of documents for the existing snapshot_version didn't change.

baldersheim commented 1 year ago

I guess this happens during what we call lid space compaction. Internally in the back each document is assigned a local document id when it is ingested. This lid is an integer number in the range[1.....current_lid_limit]. When a document is removed, its lid is placed on a freelist after all live references to it has gone. A new document is then given a lid that is on the freelist, or assigned the next integer after current_lid_limit, which is then increased.

Having a large lid freelist has a cost, both extra memory and more cpu spent query time. We call this waste lidbloat. Once the lidbloat exceed a threshold, a background process, lid-space-compaction kicks in. Documents with the highest lids are moved to a lower vacant lid. The document is first copied to its new location, and then deleted from its old location.

I assume what happens is that the first the query is executed and a document with a high lid is a match. But then when grouping is executed afterwards this document has now completed its move to a new lower lid, and its data has been deleted from it older higher lid.

The scenario you have with generational feed and TTL for removal is more likely to observe this. In this use case lid-space-compaction is rather an anti-feature.

1 - We will add configuration to disabling lid-space-compaction. 2 - If my theory is correct we will investigate ways to improve it. Most likely by deferring removal of data at old high lid until all references are gone.

wsandeepd commented 1 year ago

Thanks @baldersheim. So, if I understand it correctly, since we're seeing -ve values, it means that a few documents which were originally selected (before grouping) were since deleted by the time grouping was executed. So, is it safe to say that these documents which should've been considered to return a value after grouping aren't actually considered because of compaction?

wsandeepd commented 1 year ago

@baldersheim @jobergum thank you so much for the quick turnaround on this issue. My original thought was to do a grouping on our snapshot_version field and filter out the group with Long.MIN, but after @baldersheim's comment, I understand that is incorrect.

Since this is affecting our production systems, I was hoping to get an ETA on the release with the fixes for this issue or any recommendations wrt what we can do to workaround this problem. Please let me know. Thank you!

baldersheim commented 1 year ago

1 - Configuration control of lid-space compaction will come next week. 2 - Will know ETA next week for better handling of lid-space compaction.

baldersheim commented 1 year ago

PR for item 1 is in #27223

wsandeepd commented 1 year ago

@baldersheim thank you! what would be the impact of disabling lid-space compaction? as i understand, it's bound to increase the memory footprint, so would you suggest it's something we should do when it's available?

baldersheim commented 1 year ago

It will not increase the maximum memory usage. memory usage will stay at its peak. Let us say that you at some point in time have 20M docs on your node consuming, for simplicity 20G memory. If you then remove 10M docs you will then use 10G memory afterwards once lidspace has been compacted. If you disable lidspace compaction, your memory usage will somewhere between 10G and 20G. Your schema decides if you will be closer to 10G or closer to 20G. If your corpus again grows to 20M docs you will not grow beyond 20G as that bloat will be reused. So in your usecase I would disable lidspace compaction. You will avoid wasting cpu and I/O to keep lidspace compact.

baldersheim commented 1 year ago

Documentation added in https://github.com/vespa-engine/documentation/pull/2690

wsandeepd commented 1 year ago

thanks @baldersheim, that makes sense. once the update is available with the config option to disable lid space compaction, we will use it. pls let us know, thanks!

geirst commented 1 year ago

Tuning lidspace max-bloat-factor is part of Vespa 8.171.14.

wsandeepd commented 1 year ago

thank you @geirst! @baldersheim for our use case, would you say that a value of 0.95 would be a good enough value to avoid facing the issue?

wsandeepd commented 1 year ago

@jobergum do you think this is fine?

baldersheim commented 1 year ago

0.95 sounds high enough. It depends how much your corpus varies in size. If you use a generational approach and have 2 generations a value of 0.6 would be fine.

wsandeepd commented 1 year ago

@baldersheim by setting the value to 0.6, we're effectively not disabling lid-space compaction entirely but limiting when it occurs (when 60% of the documents have been TTLed for example). so, if i'm not wrong the issue of getting -ve values with our generational workload approach will still happen, although less severely. is this understanding correct?

If we want to entirely disable compaction by setting the value to 1.0, we will have to live with the bloat (and thus a memory peak) until the next version is written when the lid values are reused. If this is a reasonable compromise for us, do you suggest setting the value to 1.0 to disable compaction as this will ensure correctness of the vespa results?

baldersheim commented 1 year ago

Your understanding is correct. If you have a huge variation in number of documents concurrently on a node you should set it higher.

wsandeepd commented 1 year ago

thank you @baldersheim. we'll set the value to 1.0 on our test env and observe the behaviour. I will get back with the observations.

wsandeepd commented 1 year ago

hi @baldersheim @jobergum we deployed version 8.174.17 of Vespa with the config parameter for max-bloat-factor set to 1. As per my understanding, this translates to never triggering the lid-space compaction unless all the documents in our schema are deleted. However, even when we have one snapshot_version of the data in the schema, the issue with the Long.MIN group is still reproducible. Maybe I'm missing something, so was wondering if I could get some help in this regard, thanks.

baldersheim commented 1 year ago

I think we need to establish state first: 1 - Did you observe any difference at all when you turned off lid-space compaction ? 2 - In your last comment you are stating "one snapshot_version" ? Does that mean that all documents in corpus have the same version, so the query actually touches matches all documents.

If your query matches documents that have reached the TTL you still have the chance of hitting https://github.com/vespa-engine/vespa/issues/27177#issuecomment-1558648675. That problem does not have any solution. The probability of observing this issue is linear to the time between T1 and T3, which is linear to the number of documents matching the query.

wsandeepd commented 1 year ago

@baldersheim pls find replies -

  1. not really - with the value of max-bloat-factor set to 1.0, found no change in behaviour.
  2. we always have more than one snapshot_version during the time this issue is reproducible. The documents with the older snapshot_version is the dataset that gets TTL'ed during this time.

We always query with the newest snapshot_version - the one that's not getting TTL'ed at any moment, so our query never matches with the docs that have reached the TTL.

wsandeepd commented 10 months ago

hi @baldersheim a correction on 1. At some point, when one snapshot_version is getting TTL'ed (the older version) we have another version (the newer version) which is active and not being TTL'ed. When we set the max-bloat-factor to 1.0, we observe that querying with a filter on the new snapshot version doesn't result in the -ve version. However, it does appear when we query with a filter on the version that's getting TTL'ed.

Since we've effectively disabled compaction by setting the value to 1.0, documents from the newer version aren't changing their lid values and that's the reason this issue is mitigated. Can you confirm if this is the correct understanding?