rilldata / rill

Rill is a tool for effortlessly transforming data sets into powerful, opinionated dashboards using SQL. BI-as-code.
https://www.rilldata.com
Apache License 2.0
1.7k stars 114 forks source link

Runtime: Re-ingest local files if they are updated #5479

Closed begelundmuller closed 2 weeks ago

begelundmuller commented 2 months ago

We have gotten several reports about file sources not being updated when the local file is changed. This is especially problematic for mapping.csv files used for access management in security policies.

We need to be careful only to do this for small local files, to avoid accidentally scanning large/remote files (even for computing a hash). Maybe it should be an optional parameter for the file: connector?

begelundmuller commented 2 months ago

I don't have a clear idea about how we might solve this, but here are some thoughts (probably too hacky):

k-anshul commented 1 month ago

We can also consider following alternate solutions if the use case is only for mapping files kind of situations:

  1. Create a non materialised model on top of the file so that every query that reads data from the view reads data from the file. For files with few hundred rows the difference is less than 5ms.
  2. Add a source/model config invalidate_always which always returns true for ModelManager.Exists so that the model is always updated on reconcile. Since these mappings are expected to be small it should be okay to always refresh them.

1 has a limitation that dependencies will not be refreshed. 2 has a limitation that dependencies will always be refreshed.

If we want to solve this for all cases then I think the solution listed above is the only way. Two considerations though :

  1. For clickhouse the local file is actually stored on the clickhouse server so we can't compute its hash. Also given clickhouse sources are always run as models we only have the queries and not the actual path of the file.
  2. I think we also need to store the path of the file in ModelManager so that hash can be recomputed.
begelundmuller commented 1 month ago

We can also consider following alternate solutions if the use case is only for mapping files kind of situations:

  1. Unfortunately they're often referenced in security policies and dimension/measure expressions, which get hit a ton (e.g. when listing dashboards, we check the security policy of each dashboard in the project), so I'd be too worried about the number of disk reads and CSV parses here.
  1. This might be okay, though it should be used cautiously so it doesn't invalidate large downstream models too frequently. However, there's an issue that applies here, which also applies to my suggestion above, which is that if a model file isn't edited, there is no call to Reconcile at all. Reconcile is only called on controller restart or when the resource spec changes.

If we want to solve this for all cases then I think the solution listed above is the only way. Two considerations though :

About 1. here – for ClickHouse, the file/local_file connector should still load the local path on the runtime and upload/write it into a ClickHouse table. Because the goal for the local file connector is using small data files that are checked into Git.

k-anshul commented 1 month ago

About 1. here – for ClickHouse, the file/local_file connector should still load the local path on the runtime and upload/write it into a ClickHouse table. Because the goal for the local file connector is using small data files that are checked into Git.

Sure I think this can be handled separately. I will raise an issue.