pathwaycom / pathway

Python ETL framework for stream processing, real-time analytics, LLM pipelines, and RAG.
https://pathway.com
Other
2.84k stars 98 forks source link

[Bug]: `UDF_CACHING` persistence mode persists input if `persistent_id` is set. #59

Open KamilPiechowiak opened 2 weeks ago

KamilPiechowiak commented 2 weeks ago

Steps to reproduce

This code persists input, I am not sure if it should. Notice that persistence_mode is set to UDF_CACHING:

import pathway as pw

class InSchema(pw.Schema):
    a: int
    b: int

t = pw.io.csv.read("a.csv", persistent_id="abc", schema=InSchema, mode="static")

persistence_backend = pw.persistence.Backend.filesystem("./xyz")
persistence_config = pw.persistence.Config.simple_config(
    persistence_backend,
    persistence_mode=pw.PersistenceMode.UDF_CACHING,
)
pw.debug.compute_and_print_update_stream(t, persistence_config=persistence_config)

If you run the code twice, you'll see that the values are read from persistence on the second run.

Relevant log output

First run:
            | a | b | __time__      | __diff__
^31NXFBM... | 1 | 3 | 1718180081298 | 1
^TC3B0CF... | 2 | 4 | 1718180081298 | 1
^VH8R9JC... | 3 | 5 | 1718180081298 | 1

Second run:
            | a | b | __time__ | __diff__
^31NXFBM... | 1 | 3 | 0        | 1
^TC3B0CF... | 2 | 4 | 0        | 1
^VH8R9JC... | 3 | 5 | 0        | 1

What did you expect to happen?

UDF_CACHING mode not persisting the input even if persistent_id is set or error that the persistent_id is set in UDF_CACHING mode.

Version

0.12.0

Docker Versions (if used)

No response

OS

Linux

On which CPU architecture did you run Pathway?

None

embe-pw commented 2 weeks ago

In general the persistence_mode is not documented enough. I agree that it is confusing that enabling UDF caching enables the rest of the persistence mechanisms.