qcif / data-curator

Data Curator - share usable open data
MIT License
264 stars 38 forks source link

Support for tableschema metadata #1003

Closed JDziurlaj closed 4 years ago

JDziurlaj commented 4 years ago

We can now import a table-schema without any associated data (via #852). This is very helpful. One of the key motivations of that issue was to allow a user to validate their data against a published table-schema before passing it on. However, while the package generated by data-curator does provide the table-schema used by the data provider, it is changed somewhat from what was originally imported via import column properties. Furthermore, any metadata in the table-schema is removed.

This is a problem for us as we expect our format to evolve over time and thus we need to know which version of the table-schema was used in order to calibrate our ingestion routines.

Desired Behavior

The most obvious solution would be to support the metadata described by the frictionless folks. I don't see any need to allow users to edit it within the tool, it is enough that it is preserved from import to export.

The second approach, which would also be acceptable, would be to hash (preferably via SHA1) the incoming table-schema (from import column properties), and emit this as a key under the table-schema section of the data-package. This would at least allow us to correlate the version that the user imported to our repository of table-schemas.

ghost commented 4 years ago

Hi @JDziurlaj The issue of keys sitting outside of the schema in the exported datapackage.json has been raised in issue: #972. Is this part of the problem? It may be that there are previous and more recent additions that we have incorporated yet - please let me know what you believe the significant differences are - this will help in considering the upcoming priorities (the caveat being that our timeframes are somewhat limited) for this release. I think that what you are raising here is similar to: Issue: #987. If so, perhaps we could collapse this issue into that one - and you could specify further there. If I've misunderstood, and they are separate, please let me know.

JDziurlaj commented 4 years ago

987 seems to already be implemented, as you commented. I think the confusion comes down to the adage that "one man's data is another man's metadata". The table-schema is metadata about what the user is providing. What I am requesting here is another level above that, metadata about the table-schema itself. I can't speak to #972, we do not currently use foreign keys.

ghost commented 4 years ago

Hi @JDziurlaj

At the moment we're shortlisting potential candidates for this release cycle and I think there are ideas here that I'd like to raise with our sponsors as worth consideration. Please correct me on any of the below:

I think a hash and reporting of version (if there is not a 'right way to show this already in frictionless spec) are straightforward enough to do and something that I can raise with sponsors.

If I've understood, there's a second issue here about the persistence of some properties from the original import to export. I'll look through to see what happens to existing metadata from import to export again, but if you had some examples of data that show this to post here, that might help speed things along for working out effort required. I can certainly look at where we might be able to lock things down further or perhaps not overwrite certain properties

JDziurlaj commented 4 years ago

Hi @mattRedBox,

You captured the essence of our request. A minor correction would be that metadata is lost whether or not the properties are 'locked' or not. Here is a Gist showing a table-schema prior to import.

There is metadata such as created, lastModified, version, etc. that has been removed from the data-curator "packaged" version.

ghost commented 4 years ago

Hi @JDziurlaj Ah, thanks now I see, yes I think was a deliberate implementation a while back to go one way or the other - we stayed with what we thought was conservative at the time to just stick with frictionless properties that we could manage without introducing potential risks. However there may be more-recently implemented frictionless properties which we're not fully capturing yet - I'll have a look and see. I have this issue in our next milestone (it sits under the major tasks to do which are #987 and #986), so I may be able to get to this, perhaps introducing a toggle to turn this behaviour on/off ie: keep existing properties and write out afterwards. You'll note that we also have #988, which I think ties into this idea well of having other properties. Not sure if I'll have time to get to all the ideas, above (we tend to throw as many ideas as we can into the milestones and just work through in priority until we have chewed up our time allocation), but I think I can get to introducing the means to show the frictionless schema versions used in Data Curator. Did you have any thoughts about how/where we could/should display these schema versions?

ghost commented 4 years ago

Hi @JDziurlaj And thanks for this gist. Do I have your permission to use it or parts or in implementing tests that I might add to the application?

JDziurlaj commented 4 years ago

Did you have any thoughts about how/where we could/should display these schema versions?

I would put it under the Table sidebar, much like Package has a version.

Yes, you may use the Gists, they are publicly available.

ghost commented 4 years ago

Ok thanks @JDziurlaj Yep makes sense - I'll try to use a label that makes it clear though that this is (a read-only?) frictionless schema version, rather than the version assigned to the metadata by the user. Help text too might be useful.

ghost commented 4 years ago

Looking to combine some of this work, the non-Data-Curator properties persistence, with: #988.

ghost commented 4 years ago

Hi @JDziurlaj I have been able to include the underlying up-to-date frictionless libraries underneath, but there are a couple of properties (like date) that we don't explicitly add yet to Data Curator. Because of this, putting a version might give people the wrong idea I think about what is and isn't present in the displayed properties. As we were already doing work on #988 and because our sponsor was particularly keen on this idea (having custom properties), I thought this might be a way to satisfy at least one of the issues raised here (having existing properties that didn't propogate through). So following on this idea, a user can now go into 'Preferences' menu and specify:

JDziurlaj commented 4 years ago

So to be clear, you are saying that custom properties are or are not maintained through the Import Package/Column properties pulldown function?

ghost commented 4 years ago

Hi @JDziurlaj Yes custom properties are maintained (with updates) in lastest beta release. At the moment, the catch is of course:

I've lodged an issue to address the use of 'toggling' certain conventional/expected behaviours. It could be that we add to this list, say, a preference toggle that: