secretGeek / csvz

The hot new standard in open databases
Creative Commons Zero v1.0 Universal
30 stars 2 forks source link

Make the spec more meta #14

Open doekman opened 4 years ago

doekman commented 4 years ago

I think the specification is now too broad.

Imagine the spec only says there can be meta csv, tables, columns and relations tables (specifying the name). Tool makes then can come up with profiles that describe what data in what meta-tables are put, and what the semantics are. These profiles then can be registered and published in this repository. There could be discussion of course.

You could have an ANSI-96-SQL-IMPORT-EXPORT profile (hope they come up with a better name). But you could also have a MY_OPEN_SOURCE_FORM_APPLICATION profile, that describes how data on forms are validated.

The advantage of tool makers (implementors): you only create was is being used. I do think there should be a csvz implementers forum.

So I propose a more minimalistic base specification, with extensions as profiles. The _meta/csv.csv profile can be built in and be called "localized" profile (if you want to call it that).

What do you think?

secretGeek commented 4 years ago

as always i am certain that your idea is good and valid and helpful.

the first thing i know to say yes to is having a "csvz implementers forum." How do we do that?

i created a wiki and asked there in case potential forum joiners look there. i'll tell people in the contribution to join as well. https://github.com/secretGeek/csvz/wiki

i am having trouble knowing how to translate this into changes to the specification.

does this mean we would take out existing sections? (which ones) or does it mean we would change existing sections? which ones and what sort of changes?

it should be assumed that these locations are reserved:

_meta/*.csv
(*.csv) included anywhere recursively under meta
all folder names reserved under there unless otherwise specified

These are the files that are currently partially defined (all of which are optional).

_meta/tables.csv
-meta/columns.csv
_meta/relations.csv
_meta/csv.csv 
_meta/datatypes.csv

Where:

Maybe _meta/meta.csv which described which of the standards you claim conformance with. e.g.

fragment;conformance;notes
csvz-0;strict;this csvz file claims strict adherence with csv-0
csvz-meta-tables;strict;yes we have a _meta folder with a tables.csv file in it.

(If they only claimed those two rules were followed then it would be up to the consumer to read the files and determine for themselves how to make sense of them.)

Coming soon also we'll likely propose that these paths are added, that allow for per-file meta files.

_meta/tables/*.csv
-meta/columns/*.csv
_meta/relations/*.csv
_meta/csv/*.cs

What if there was:

for example if someone reserved "to-sql-server" then they would have reserved the right to define the naming, structure and meanings of all files/folders that go in this location:

_meta/tools/to-sql-server

Scalability problem: what if there are too many tools?

then we'd use software to solve that problem.

Or guids. (Guids are better)

Once they've registered a name (or a guid) they can then publish standards that someone generating a file can claim to have conformed with.

e.g.

csv-tools-to-sql-server--indexes would be a spec fragment defined by whichever repository is the official owner of the to-sql-server tool (thus they get ownership of the csv-tools-to-sql-server--* spec fragments.)

So if I've stored a csvz file that i think the "to-sql-server" tool can read, and i want it to have very specific indexes built when it is converted to sql server, then i'll write the expected files in the csv-tools-to-sql-server--indexes folder.

An alternative is that the concept of "tool" can be replaced with "vendor". (This all reminds me of vendor prefixes in css standards.) i prefer "little" tools and "little" spec fragments to "big" vendors. the littler the better.

MarkPflug commented 4 years ago

So I propose a more minimalistic base specification, with extensions as profiles.

I'm not sure I know what you mean by this. Do you mean more minimalistic than it already is? Or do you mean don't specify any more than is already specified?

Personally, I think if the columns specification listed the expected set of supported types then the spec would cover the 99% use case: export data out of one database and import it into a different one while preserving the rough schema. Extensions could remain unspecified, as long as the structure of files covered by the csvz spec were correct then any tool could read it.

Consider an example: Tool A handles several custom profiles. A file created with tool A is then processed by tool B, which doesn't know anything about the custom profiles. Should tool B be responsible for preserving the metadata from those profiles? It would require some specification for how a profile defines what/how its custom metadata needs to be preserved. That sounds hard to specify, let alone implement. Certainly B cannot produce/provide metadata for the unknown profiles, but it could conceivably preserve them.

Maybe consider a concrete example: let's say I want my csvz tool to carry custom metatdata about the indices that existed in my database. How would that be handled by a profile? How would the abstract concept of profile be specified in this context? How would the concrete "indices" profile be specified?

Not that I disagree with this idea, I just think the beauty of csvz is in the simplicity. I worry that the concept of profiles starts down the road of making something too complex for anyone to bother implementing.

doekman commented 4 years ago

That's quite a reply. Not sure if I'm able to reply to all of it right now.

Implementors forum serves a couple of purposes. Visibility what is being worked on is an important one. It's nice to know what people are working on, so you know what to expect in the near future. Also, it should be a place where you can ask questions and discuss them. If a database stores a NULL value; how to store that in a CSV file, and handle it within CSVZ.

I myself would like to have a tool to load/save data from the command line (also makefile) to/from a database. I do have a lot of questions still; not sure if I want to do this as a side project. I put my thoughts on the implementors forum page.

I think the core csvz-specification would be the part under csvz-0. There are still some additions to make. You need to say things about the _meta folder. And does it need to exist within a zip? (thinking of git).

The other parts are "guidance" to the implementors, and can later be used

I just don't want to be this specification a theoretical exercise. I really hope some nice tools will evolve from it. This spec should be a framework for tool implementors, not a todo-list.

You have a lot ideas about this framework (_meta/tools folder), but maybe it's better to wait if an implementor needs something. And invite them to discuss those needs here on this forum.

One more thing about the profiles: each profile describes what meta-files and -columns the profile needs. For example, the database-import-export-profile and the ms-access-clone-profile both use the file _meta/tables.csv. Some columns overlap, but some are specific to a profile (control-type column for the ms-access-clone profile).

doekman commented 4 years ago

@MarkPflug I didn't read your reply until now. Yeah, I favour simplicity too.

However, the sky is the limit. Say, you want to transport avatar-images with your csvz file, because they are stored in the database. You could store them as base64 in and csv file, but also in a folder <tablename>/avatar1.jpg next to the csv <tablename>.csv, which contains a column avatar with filename avatar1.jpg.

Would you add that to the spec to, or do you put it into a separate profile?

Anyways, these are just my thoughts.

secretGeek commented 4 years ago

Interesting interesting.

If this was done:

Say, you want to transport avatar-images with your csvz file, because they are stored in the database. You could store them as base64 in and csv file, but also in a folder /avatar1.jpg next to the csv .csv, which contains a column avatar with filename avatar1.jpg.

... then perhaps that column would be described as having a data type of (I don’t know... ) internal file reference... and there would need to be a spec fragment covering what that means (for example it couldn’t point to a file outside the csvz. )

And if you were writing a tool that could read csvz and could write to (for example) Sql server data bases... then you’d decide for your technology what’s the most effective way to stick those into the resulting database.

If someone else was writing a “from Sql” to csvz file... and they grabbed your database and wrote it all to a csvz file... you’d probably hope/expect/not be shocked if those files you stored ended up as being file reference types there too. But there’s no real guarantee on that sort of thing. If it succeeds then I guess you could tick a box that says that pair of tools have the “round tripping” achievement in the csvz tool store.

MarkPflug commented 4 years ago

github discussions are still coming, right? Enabling discussions on this repo seems like the easiest solution. Until they do, you could just create (and pin) an issue with the title "Discussion". Interested parties could subscribe to that issue.

FWIW, I already have a very rough implementation on .NET (as a library, not tool) of the csvz-0, csvz-meta-tables and csvz-meta-columns (but not relations): https://github.com/MarkPflug/Sylvan.Data.CsvZip

I would imagine avatar images would be carried base64 encoded in a column with type = "binary". That is how my implementation would handle it anyway, assuming it was exported from a binary column in a database.