treeverse / lakeFS

lakeFS - Data version control for your data lake | Git for data
https://docs.lakefs.io
Apache License 2.0
4.48k stars 359 forks source link

[Bug]: DuckDB query doesn't show updated results if the object changes elsewhere #5494

Closed rmoff closed 1 year ago

rmoff commented 1 year ago

What happened?

Current Behavior:

When you execute a query in the DuckDB pane of the object page and then change the underlying object, if you re-execute the query the results don't change.

Steps to Reproduce:

  1. Spin up the Docker Compose from https://github.com/treeverse/lakeFS/tree/docs/devex-173-quickstart/quickstart

  2. From http://127.0.0.1:8000/repositories/quickstart/object?ref=main&path=lakes.parquet run the default DuckDB query. Note the results

    CleanShot 2023-03-15 at 16 13 40@2x
  3. Get a duckDB CLI prompt docker exec -it duckdb duckdb

  4. Load the parquet file as a table, delete some rows, and write it back to lakeFS

    SET s3_endpoint='lakefs:8000';
    SET s3_access_key_id='AKIA-EXAMPLE-KEY';
    SET s3_secret_access_key='EXAMPLE-SECRET';
    SET s3_url_style='path';
    SET s3_region='us-east-1';
    SET s3_use_ssl=false;
    
    CREATE TABLE lakes AS select * from read_parquet('s3://quickstart/main/lakes.parquet');
    DELETE FROM lakes WHERE country != 'Denmark';
    COPY lakes TO 's3://quickstart/main/lakes.parquet' (FORMAT 'PARQUET', ALLOW_OVERWRITE TRUE);
  5. Read the parquet file back directly to verify the change to the data:

    SELECT * 
    FROM read_parquet('s3://quickstart/main/lakes.parquet')
    LIMIT 20;
  6. In the same browser window as before, click Execute. Note that the data does not change. Even if you change the value on the LIMIT clause (e.g. from 20 to 5) the new data is not shown.

    Refresh the web page using the browser's controls and note that the correct data is now shown.

    https://user-images.githubusercontent.com/3671582/225373703-2a7a2b99-f2ac-483e-953e-9bdf7ff6c6fb.mp4

Expected Behavior

When you run a query with DuckDB it should show the current data in the file.

If it is not going to do this then the UI should indicate very clearly that the data could be stale and have a button to force a refresh of it without requiring the user to reload the page (and thus lose their SQL query)

lakeFS Version

0.96.1

Deplyoment

Docker

Affected Clients

No response

Relevant logs output

No response

Contact Details

No response

johnnyaug commented 1 year ago

Looks like this was done here: https://github.com/treeverse/lakeFS/pull/4903/. There is a trade-off between performance and data-freshness here and we decided to side with performance. However I agree that having no way to refresh the data is a problem.