CamilleLegeron commented 1 year ago

Hi!

In many cases we need reference data that are the same in all Workspaces of a same Site. For the moment we duplicate each time that reference data in each Workspace, so we have to maintain this master data everywhere. An exemple : the list of French cities. We need it in a lot of Workspaces and it's everytime the same list.

It would be great to have a space in the left panel with these master data directly in the root Site.

We imagine that it must not be an easy feature, but it would be very useful for us

(Maybe there is already an issue that we can't find)

@vviers

Camille

paulfitz commented 1 year ago

Hi @CamilleLegeron, Grist does have a space in the left panel for "examples and templates" that can be set up - here it is in our hosted service:

Screenshot from 2023-02-08 10-53-12

Superficially, that sounds related to what you are asking for, but I suspect you want the reference data to be available within documents, for use in Reference or Reference List columns, is that correct?

vviers commented 1 year ago

Yes, what we would like is "reference data" maintained centrally and made available across a whole Grist site (so that, for example, when french list of cities will change (which it does every year)) we only need to change in one location and all documents referring to this data can benefit from the update.

(PS: this is linked to the email i sent you and Dmitry earlier today)

paulfitz commented 1 year ago

I understand now, thanks.

One feature we have considered is table replication, where a table in one document is copied to many other documents, and then kept up to date. The nice feature about replication is that the documents remain complete databases with clear referential integrity. The not-so-nice feature is the duplication of data. This is a very general purpose feature that could find many uses, including the site-level master data use case but not limited to it.

There are other possibilities. Let me ask you, if a document from a site with site level reference data is downloaded, and opened in another installation of Grist, what should happen? What should actually be stored in the document?

I could imagine doing the following as a thought experiment.

Allow a Grist table to be configured so row ids are never reused, or perhaps are uuids (this is to prepare to refer to it in situations where referential integrity can't be guaranteed).
Allow a site to be configured to have one special "shared" document. Every other document in the site, when opened in the web client, will open with a connection to both the opened document and the shared document, meaning we'll be receiving broadcasts about updates to either.
... here things get tricky. I want to incorporate tables from the shared document in the opened document, but there's a lot of nuances about meshing metadata and ids to think through. And how they might change in future is also tricky. But there might be a way to draw a line.

What do you think @dsagal ?

dsagal commented 1 year ago

One question about requirements: does the shared data sometimes need to be accessed by formulas? E.g. if Cities is a shared table, and another table in the current doc has a reference to a city, would it be important to be able to access e.g. $city. population (as it is for regular tables)? If not, i.e. if only showing data in UI is important, it opens up more options, I think.

vviers commented 1 year ago

@dsagal

One question about requirements: does the shared data sometimes need to be accessed by formulas?

I can see myself wanting to access shared data via a formula. $city.population very much looks like one of our use cases for shared data.

@paulfitz

There are other possibilities. Let me ask you, if a document from a site with site level reference data is downloaded, and opened in another installation of Grist, what should happen? What should actually be stored in the document?

I think I would be OK with shared site data being lost when a document is re-uploaded outside of the site. Maybe an option when downloading a .grist that references shared data would be to write this data not as a reference but as plain data ? We haven't really came across the use case of migrating a doc from one site to another yet, so I can't really wrap my head around whether this behavior would break stuff.

Meshing data and metadata between sites sounds like a very tricky (albeit interesting!) engineering problem that we might want to tackle as a follow-up step to a site-level-only implementation, wdyt ?

CamilleLegeron commented 1 year ago

It seems tricky to avoid duplication while keeping the other features (notably reference, formula etc). In any case today it's data that is duplicated in each document. Maybe forbid the "attachment" type to limit the size of these documents? If the tables of the document are duplicated there is still the problem of synchronization that arises, no ? We should forbid the edition in the duplicated table and trigger a synchronization at each modification of the meta document When exporting the .grist if the data has been duplicated the problem of loss of information does not arise.

CamilleLegeron commented 1 year ago

It seems tricky to avoid duplication while keeping the other features (notably reference, formula etc). In any case today it's data that is duplicated in each document. Maybe forbid the "attachment" type to limit the size of these documents? If the tables of the document are duplicated there is still the problem of synchronization that arises, no ? We should forbid the edition in the duplicated table and trigger a synchronization at each modification of the meta document When exporting the .grist if the data has been duplicated the problem of loss of information does not arise.

@dsagal and @paulfitz any idea or opinion to advance on this subject?

paulfitz commented 1 year ago

I agree that duplicating data, treating it as read-only within a "subscribing" document, and establishing a method for keeping it up to date is overall simpler than many of the alternatives - and meets the need you have.

It relates somewhat to incremental imports. In an ideal world, Grist could keep part of a document up to date with information imported (and re-imported, perhaps automatically) from some outside source (such as a URL). Viewed that way, your feature is "just" a special case of that.

What if the existing import mechanism were improved in the following ways:

An option was added to import from another Grist doc on the site.
An option was added to save import settings in a way whereby they could be repeated incrementally with one click.
A setting was added whereby, if a doc has a one-click import from another doc, and the data in the other doc was detected as freshly changed, the import would rerun either automatically or by prompting an owner.

It isn't quite the same idea, but it could push Grist in a nice direction, rather than being more of an unintegrated feature. The downside is it wouldn't be as smooth to begin with as something very special-purpose could be. Imports are currently somewhat clunky.

vviers commented 1 year ago

hi @paulfitz !

Bettering the import experience is definitely very high on our list of nice-to-haves, and the steps you outline make a lot of sense to me !

One feature that we would like to add to the plan you outline is the option to import several linked tables from another Grist doc on the site (in the main use case we have in mind for this feature, we would like to keep all docs up-to-date with cities, metropolitan_areas, and regions tables with many reference columns linking them together)

Again, I'm wondering whether this feature is too "core" for us to work on, and whether, if that's the case, Grist would be interested in us financing these developments by some devs from your team ?

We can start dipping our toes in the import code and let you know how it goes, would a good first step be to tackle the import from another Grist doc on the site ? How would permissions play into this ? My gut feeling is that only an admin of both docs (or maybe only a site admin to start with) should be able to perform this operation.

dsagal commented 1 year ago

Hi @vviers! I agree with the concern that this feature may be too "core", in the sense that it would be more efficient if we worked on it directly. In particular, the import code is complicated and may require refactoring as part of this project, and the related aspects would be connecting several different areas of the product. Let me connect with you directly to discuss options.

fflorent commented 12 months ago

Hi there!

I am planing on implementing that feature.

In order to ease the development and also in order to leave control over the data being fetched, we would go to the solution proposed by @paulfitz with an action from the user required to refresh the data.

UX

Import from scratch

NB: this is my proposal after discussing about this feature. @vviers and @CamilleLegeron may also give feedback about this.

Regarding the user journey to create a data source synchronized from an external document (aka synchronized data source), I would see this:

The user clicks « Add new » → « Add page » and is offered this view: select a widget with external data source

(at least with more cosmetic :sweat_smile:)

They are offered to select a document from their own instance.

About this idea of entering a table URL, that should lead to the same instance or to a public document, I am not sure how difficult it would be to implement that, I would suggest to implement it later if we want to.

Once the user submits, the data are fetched from the external document and put in a new table and a new data source.

Then the user can see the table with synchronized data source: left menu with list of tables

And the synchronized data source itself:

Both data source list and table list would offer another context menu to synchronize the data source: synchronize action in context menu

And I would suggest that also a click to the 🔁 icon would trigger that action.

When they trigger this action, the user is offered a popup to review the changes and validate if they want to: 2023-07-20_15-51

Import into an existing table

To be done

Forget / break the link

To be done

vviers commented 12 months ago

Looks promising to me at first glance ! :)

The user clicks « Add new » → « Add page » and is offered this view:

I am unsure whether External Table should be offered as a new type of widget alongside Table. I think conceptually the option of picking an external data source should pop up in the "Select Data" menu once you've selected a widget type maybe ?

On the other hand, making External Table a special type of Widget would allow for implementing a more dedicated interface in the options panel to the right, which might be what we want 🤔

paulfitz commented 12 months ago

Could possibly have an option to break/forget the link to an external document, in which case the feature could help with https://community.getgrist.com/t/move-table-to-another-document/2888/8

fflorent commented 11 months ago

Could possibly have an option to break/forget the link to an external document, in which case the feature could help with https://community.getgrist.com/t/move-table-to-another-document/2888/8

Sure break / forget the link is an interesting feature!

I am unsure whether External Table should be offered as a new type of widget alongside Table. I think conceptually the option of picking an external data source should pop up in the "Select Data" menu once you've selected a widget type maybe ?

That's true! ~~However, regarding the UI, I feel hesitant with this option, as it would suppose adding a third panel to the right which would appear when we select "External table". That might be somewhat OK, IDK, I would love to have the feedback from a UX/UI expert about this.~~

Edit: actually there is a third panel for filtering / sorting data. I will scratch my head a bit more to see if I can make a good alternative proposal based on your remark.

On the other hand, making External Table a special type of Widget would allow for implementing a more dedicated interface in the options panel to the right, which might be what we want thinking

Not really a widget, IMHO. Also a widget is a special view rendering data from a data source. In our case, it's rather a special data source. Which leads to make consider the below interrogation :sweat_smile:

Then the user can see the table with synchronized data source:

Reconsidering this option I proposed, I am less sure. This is a list whose items are composed of 1 or several views, whose source can be various. Or it may hint that this item contains at least a view linked to a synchronized data source? I feel like it could possibly make sense.

paulfitz commented 11 months ago

There's another place where the external table could be brought in - with the other import options: Screenshot from 2023-07-25 14-15-13 Those import options lead to an import process that can either create a new table or update an existing table. @fflorent what if we supported Import from another Grist document along with the other import options? Even without the icon for later one-click sync, that would be immediately useful. And, when the one-click-sync feature exists, it would be great if it could also apply to imports like Import from URL.

A downside is that Grist could potentially do a much higher fidelity import from another document, with e.g. conditional formatting and the like, but that might be hard to organize within the generic import mechanism (@georgegevoian might know more about the possibilities here).

fflorent commented 11 months ago

Those import options lead to an import process that can either create a new table or update an existing table. @fflorent what if we supported Import from another Grist document along with the other import options? Even without the icon for later one-click sync, that would be immediately useful. And, when the one-click-sync feature exists, it would be great if it could also apply to imports like Import from URL.

I like the idea!

but that might be hard to organize within the generic import mechanism

When you mean that, do you mean technically or regarding the UI?

paulfitz commented 11 months ago

but that might be hard to organize within the generic import mechanism

When you mean that, do you mean technically or regarding the UI?

I mean technically - but I could be wrong, I haven't looked closely. For the UI, I'd guess that having things "just work" would be the expectation (all formatting and options that are practical to copy/update would be copied/updated), so no particular controls needed.

fflorent commented 11 months ago

I created a pull-request in my own repository with some changes to the initial proposal: https://github.com/fflorent/grist-issues-proposals/pull/1

With the hope it will be more convenient to review / comment it.

Feedback highly welcome! :pray:

alexmojaki commented 11 months ago

Personally I was agreeing with the discussion here (which I'm only reading now, sorry) until it pivoted to using imports. Imports are complicated, both in the backend and the frontend. I imagine most users who want this kind of feature would much prefer to have the data stay up to date automatically (even instantly) rather than the flexibility of imports.

Here's what I think is required for syncing, and it doesn't sound too bad:

When the 'master' document is open, and a 'source' table is modified (i.e. a doc action is applied to that table) broadcast that doc action to open documents with that table
Doc actions on the table/column metadata tables which affect the source table are broadcasted in some special way where the refs are translated to string IDs.
When a subscribing document is opened, source tables are loaded from the master document and data is fully replaced
Edits to subscribing tables are forbidden, maybe reusing how access rules work but forbidding everyone regardless of access.
Attachment columns in source tables are forbidden, at least initially, because that adds complications.

paulfitz commented 11 months ago

2. Doc actions on the table/column metadata tables which affect the source table are broadcasted in some special way where the refs are translated to string IDs.

In principle, as a first pass, you could skip steps 1 and 2, right? And have non-live syncing, where a sync could be triggered by reopening the document (or perhaps instead clicking a button).

Looking at https://github.com/fflorent/grist-issues-proposals/blob/bcc569b8def1063c19c7bd4c466ed38490df9572/issue-416/design.md I see the mapping parts of the import UI skipped over, which seems reasonable.

The part for showing diffs prior to import does line up with a part of the existing import mechanism that has an ugly implementation currently. It could perhaps also be skipped on a first pass?

Not sure if the options are very different on an unambitious first pass, but yes by nudging towards imports I am hoping for some fairly general purpose improvements to incremental imports in a second pass. It could be wrong headed, since Grist documents are indeed a special case.

There's a complete other direction to push this, where the inserted table is seen as the result of some workflow, like any workflow that generates a table (e.g. a bit of pandas code), and we double down on workflows. Pretend I never mentioned that, to cut down on the options :-).

dsagal commented 11 months ago

Related wishes Let me share something related from another customer who has built up sophisticated processes in Grist. They want to clone parts of documents, for several purposes (they already do it with external scripts, but it's a lot of work):

Reference data, as here.
Snapshots of data for reports or analysis. Here live updates are undesirable, but a manual sync is. Also, here it's common to only want to sync a subset of columns. The destination may include additional columns, for notes or calculations, which should be preserved during sync.
When a synced table contains a reference column, if the referenced table is also synced, they want to keep the column as reference column; otherwise, they want to use the display value only.

Another thing they want is something like "materialized views" like in Postgres, which become a Grist table, again with an option to sync (like REFRESH in Postgres). This is very much like @paulfitz's note about syncing the result of a workflow that produces a table.

In regard to frequency of updates, they want a way to sync manually, or on a schedule. It is less common for them to want to sync automatically because the use case is primarily in having one frequently-updated document with raw data, and smaller documents that are derived from it for reporting, which should not change unpredictably.

Common Layer To tie this to the discussion above: Grist's imports are certainly related but I agree with @alexmojaki that Grist's import mechanism is too complicated, partly to support interactions not necessary here, partly because of its messy history. Rather than building syncing on top of existing imports, I think we need a simpler common layer that would be useful for both syncing AND imports.

What I mean by common layer is something close to the add-or-update REST API. Imports already use somewhat similar patterns for _mergeAndFinishImport and _getTableComparison, but they mix in column mappings and transforms, making it hard to reuse that code.

What I'd hope for is a common layer that takes a source table, a compatible destination table (contains at least all columns of source table), and a config (specifying which columns identify records, and a few flags like for add-or-update), and can produce a diff and an update to the destination. The hard parts to work out will be things like handling column types and options, and rendering diffs/previews. I think it's important to come up with a simple layer that can be stable, and efficient, and easy to build on top of.

Then other functionality can use this layer: syncing Grist tables fits it naturally; imports build on top of it with column mappings, transforms, and live revisions (i.e. we could refactor imports some day to use it); and tables derived from workflows (e.g. pandas output or SQL queries) could use the same layer to get materialized as Grist tables.

alexmojaki commented 11 months ago

Something (potentially important, or maybe not?) that isn't handled by imports or add-or-update is records being deleted from the source table.

fflorent commented 11 months ago

I made a POC (using the API and what I developed in PR #605). Here is what I currently get in this video:

https://github.com/gristlabs/grist-core/assets/371705/cf9bfb9b-082b-43bc-9042-64a2e4f45b83

Here is the code (dirty as it's in a POC state): https://github.com/fflorent/poc-grist-codegen/blob/ff01f3d6e4ae0b27e2a215e149c9529cb454f38a/synchronize.ts

In terms of AC (if you are not comfortable with some wording, take a look at my glossary and/or don't hesitate to suggest changes):

we create a new table and synchronize it to an external table (AC1);
the synchronization is made when the user requests it (AC2);
the columns of the synchronized table are updated/created/deleted so it reflects what exist in the external table (AC2.1);
the records of the synchronized table are updated/created ~~/deleted~~ so it reflects what exist in the external table (AC2.2);
A user can convert an existing table to a synchronized one (AC3), however the reference of the external table (i.e. doc Id + table Id) is not yet stored in the synchronized table;
If another regular table references a column of the table to synchronize, the reference is kept intact (AC3.1)

Here is what the POC does not do yet and will unlikely be hard to implement:

AC7 - MH: Attachment columns are not retrieved from source tables;
AC3.3 - SH: If the external table references a column of another regular table, the reference column is converted to a regular one (type: Ref:... ⇒ Any) in the synchronized table
AC8 - MH: The hidden columns of the external table are synchronized too;

Here is what the POC does not do yet and needs more work and thoughts:

delete the records of the synchronized table that are missing in the external table (part of AC2.2);
AC3.2 - SH: If the external table references a column of another external table, the two tables are synchronized and the reference continues to work in the synchronized document
Edits to synchronized tables should be forbidden (AC6);
:warning: Big tables cannot be synchronized due to payload size limits (I get 413 Payload Too Large)
:warning: In order to update the records, I take advantage of a column that stores in our case unique values. However, I have to think how to make it work for the general case. Maybe by storing some uuid for each record (either with or without user intervention)?

Regarding the deletion of records (AC2.2) + AC 3.2, what is currently missing is the IDs of the records added or updated. I will open an issue to suggest some changes to useractions.py and the API for that.

Also I don't intent to implement all the diff-related features (AC 2.5, 2.6, 3.4, 3.5) yet as per Paul suggested

paulfitz commented 9 months ago

[@fflorent those are intriguing commits referencing this issue up there that github is snitching on you about :-)]

fflorent commented 9 months ago

@paulfitz Yes, don't worry, we will probably talk orally a bit more about this feature next week, so we can frame this feature :).

BTW, have you taken a look at the video from my last comment (9th of august)?

fflorent commented 8 months ago

Following our discussions last week, here are some notes:

Additional vocabulary

Due to the misunderstanding, I propose to differentiate 2 steps related to the synchronization:

the configuration: some user have to tell what source a table should fetch its data;
the fetch: some user (not necessarily the one who configured the synchronization) tell Grist to fetch the columns and the data of the source table and inject them in the synchronized table;

Additional Criteria

The synchronization should work across different Grist instances;
Cycle synchronization must be prevented (imagine these table synchronizations: A => B => C => A => ... CRASH!)
- Cycle synchronization is not an issue if we opt for manual synchronization;
Synchronization (fetch) can be triggered by user who are not granted rights to read the source table;

Technical considerations

Do we have to only fetch changes of the source table since the last synchronization? No
- It would be hard to implement, also regarding the amount of data we already handle large data transfer with snapshots and we do not have much complains about it;
- Also there is not much drawback if we opt for manual synchronization;
- We could take advantage of the WYSIWYS links: when we want to bind a table to some source (configure the synchronization), we open the source table and generate a special WYSIWIS share link and paste it to the configuration panel of what will become the synchronized table;
- Therefore any user would be able to synchronize (fetch) the table which are already bound to a source, no matter if they are granted access to the source table;
- In order to notify of the source table changes (and therefore invite the user to synchronize the tables), we have to store somewhere the last modification date of a table (such value exist only for the whole document);

Manual or instant synchronization

It appeared to us that manual synchronization would be simpler:

we don't suffer from cycle synchronization;
the implementation would probably be simpler (we worry less about data volume when fetching the source table data);

Also (IMHO) I would say that it would probably offer a better user experience:

the user can review beforehand the changes in the source table and fix the tables referencing the synchronized table;
- e.g. you manage organizations which belongs to certain towns, and you have a synchronized table which references the list of the towns. As a user, if two towns are merging, I would like to decide when synchronizing the towns and fix the organizations tables.
Also probably we could opt for manual synchronization as a first step and later support for instant synchronization, until then there is still room for supporting instant-like synchronization through a bit of scripting;

vviers commented 8 months ago

FWIW I think only users with Edit rights and above should be able to trigger a synchronisation on a table.

Actually, I think I'd feel even better if only users with structure edit rights could trigger synchronisations.

alexmojaki commented 8 months ago

The synchronization should work across different Grist instances

This no longer fits the title of the issue :laughing:

I think it's pretty clear that we should go with 'manual' synchronization for now. In addition to your reasons, I find it harder to imagine implementing instant syncing across instances. This makes me agree with the earlier view to frame this as several smaller improvements to imports, starting with this comment from @paulfitz:

What if the existing import mechanism were improved in the following ways:

An option was added to import from another Grist doc ~on the site~

An option was added to save import settings in a way whereby they could be repeated incrementally with one click [or eventually automatically on a regular schedule]

A setting was added whereby, if a doc has a one-click import from another doc, and the data in the other doc was detected as freshly changed, the import would rerun either automatically or by prompting an owner.

Other features to add:

Changes to Access Rules to allow importing to a table under some conditions without otherwise having the right to edit that table
Allow deleting rows in the target table that no longer exist in the source table
- Handle references to deleted rows gracefully so that users can repair errors in data instead of references silently turning blank

edalcin commented 7 months ago

Hi Folks! I would like to add my case:

I have different projects (and students) that have to use a huge master table, that is a list of plant species in Brazil. With synonyms, it has more than 150K lines. Every specie (line) has a primary key (taxonID).

Therefore, will be great if I don’t have to replicate this table to every document, and better, to every student. In my wonderful and perfect world, I would like to reference columns between different documents as a global “document.table.column”, and better, to have “public” tables, and reference it as @eduardo dalcin.document.table.column in my students account.

gristlabs / grist-core

Make reference Data available at Site Level #416

UX

Import from scratch

Import into an existing table

Forget / break the link

Additional vocabulary

Additional Criteria

Technical considerations

Manual or instant synchronization