okfn / opendataeditor

The Open Data Editor (ODE) is a no-code application to explore, validate and publish data in a simple way. Forever free and open source project powered by the Frictionless Framework.
http://opendataeditor.okfn.org
MIT License
164 stars 19 forks source link

Feature request: Immutable file editing support (via frictionless pipelines) #322

Open khusmann opened 5 months ago

khusmann commented 5 months ago

Hello! I had a great chat with @romicolman today and she encouraged me to write some quick paragraphs describing features that would help it see adoption by me (and others in my field).

For my work (in the education and social sciences, but this extends to a lot of scientific research in general), tracking changes to data is very important. It is bad practice to edit files directly; instead it is much better to record manipulations as steps in a separate file. Right now, this is done via code (here's a great blog post detailing an example research data cleaning workflow using code)

In order for the ODE to be a viable no-code alternative for this, I need a workflow that allows me to edit data with the ODE but keep my original data file unchanged. One way this could be done would be to record all editing steps as a frictionless pipeline.

How this could look: I'd open a CSV file and make changes as usual, but when I save the file, instead of overwriting the original CSV, it would save a new file that would be a frictionless pipeline definition describing how the original CSV was transformed to accomplish my edits.

Then, I could open the pipeline definition at any point in the ODE and it would open as a regular data table, with my entire edit history loaded from the pipeline. At any point, I could also export my data package into a "rendered" version for distribution, where all the pipelines had been run to produce the "final" version of the files in a separate datapackage that I would distribute.

I realize this is a big feature request, but I cannot emphasize enough how important this capability is to a large number of potential users!

pschumm commented 4 months ago

Just looking at this now for the first time, but I completely agree with Kyle on this. While a "no code" editor will undoubtedly be very useful for a lot of researchers (e.g., especially those in the basic or preclinical sciences who often manipulate and analyze their data using things like Excel, GraphPad, etc.), building in some guardrails to (1) prevent them from getting into trouble (e.g., not being able to replicate something they did); (2) yield a process that can be shared and/or plugged-in to existing systems for reproducible, shareable work; and (3) facilitate their learning and in some cases, ultimately migrating to programmatic work would be very valuable. Kyle's suggestion might not be too much work, depending on how the editor is implemented (I haven't had a chance to look at the code yet); it would probably be pretty straightforward to record editing actions one at a time to a file. One limitation is that the resulting code would be very difficult to read (i.e., you could only read it one line at a time, and it would be difficult to get an overall sense of what was being done), and would be very brittle. Nonetheless, this could be a very valuable feature, and there is certainly precedent for this type of thing.

For my part, most important would simply be tight but automatic version control of each version of the file as it is being edited. For relatively small files (and those are, I presume, the most common use case for this type of editor), using git in the background might be adequate here.