vespa-engine / vespa

AI + Data, online. https://vespa.ai
https://vespa.ai
Apache License 2.0
5.72k stars 595 forks source link

Vespa visit not returning deleted documents when selection criteria is added. #30097

Closed 107dipan closed 7 months ago

107dipan commented 8 months ago

Describe the bug Vespa visit not returning deleted documents when selection criteria is added.

To Reproduce Steps to reproduce the behavior:

  1. Ingest a document and perform delete on the document
  2. Perform visit based on a selection criteria. We have also tried with fromTimestamp and toTimestamp.

Expected behavior Docs deleted are not returned.

Screenshots If applicable, add screenshots to help explain your problem.

Environment (please complete the following information):

Vespa version 8.294.50

vekterli commented 8 months ago

From your description I'm assuming this is the behavior you would like to observe (please correct me if I'm wrong):

  1. Feed a document $\cal D$ with ID ${\cal I_D}$ and field value state ${\cal S_D}$ at timestamp 100.
  2. At a later point in time, delete document $\cal I_D$ at timestamp 200.
  3. Start a visitor operation with timestamp interval $[50, 150)$.
  4. Visitor returns document ID $\cal I_D$ with state $\cal S_D$ as it existed at feed time, i.e. the tombstone at time 200 is ignored.

Unfortunately, this is not the behavior you will observe in practice, as Vespa is not a multi-version store. Visits with timestamp ranges return documents that were last modified within the given timestamp range. Deletion is considered a modification.

Some details as to why this is the case:

Vespa's internal data model is logically[^1] a mapping of any stored document $d$ from document ID ${\cal I}_d \mapsto {\langle T, {\cal S} \rangle}_d$ where $T_d$ is the wall-clock timestamp of the most recent mutation to that document and ${\cal S}_d$ is the current document state, which is either a set of populated document fields or a tombstone sentinel $\cal T$.

Since only the most recent mapping is retained, this means that deleting document $\cal D$ in this case transitions the internal state from ${\cal I_D} \mapsto \langle 100, {\cal S_D} \rangle$ to ${\cal I_D} \mapsto \langle 200, {\cal T} \rangle$. The knowledge of any prior version(s) is immediately garbage-collected from the system and can therefore not be returned by a timestamp range visit.

[^1]: in the real world with potentially inconsistent data across replicas, Vespa performs on-demand write-repair and read-repair to maintain the illusion of such a logical mapping.

107dipan commented 8 months ago

Hi @vekterli, We are using a selection criteria which is a range query based on a timestamp that is added by our webserver. For now as a workaround to fetch the deleted documents we are using the selection criteria "not schemaName.uniqueId" where uniqueId is a id stamped to all documents by our webserver. From our understanding since vespa maintains the tombstones for 2weeks by default we should get all the documents that were deleted in that period.

vekterli commented 8 months ago

When you refer to "deleted documents" do you mean the actual document contents (i.e. complete with field values), or simply the tombstones?

Documents that have been deleted cannot be retrieved (other than their ID in the form of a tombstone) even when visiting using a timestamp range that covers the original feed time.