perseas / Pyrseas

Provides utilities for Postgres database schema versioning.
https://perseas.github.io/
BSD 3-Clause "New" or "Revised" License
394 stars 66 forks source link

datacopy question (and documentation) #198

Open xclayl opened 5 years ago

xclayl commented 5 years ago

I'm struggling to get datacopy to work. Nothing seems to happen. Can you provide an example?

In general, the documentation seems to be targeted to developers of Pyrseas, rather than users of Pyrseas.

Would you be interested in a pull request to move these docs to a Developers sub area and have *.yaml examples for users (tables, foreign keys, columns, sequences, etc) with all the supported attributes?

As a concrete example, this page, https://pyrseas.readthedocs.io/en/latest/column.html, doesn't say that not_null is a boolean and "name" isn't a yaml property (like the way not_null is). There isn't a way someone can write the yaml based on the docs.

jmafc commented 5 years ago

The API Reference section is indeed mostly for developers, and it is so stated in the opening paragraph. However, there are some sections, e.g., Table, that have sample YAML. The general issue of documentation has been brought up before in #136. When I started the project, I wrote about it in https://pyrseas.wordpress.com in a more user-friendly manner, and the Overview-Use Cases links to a couple of those posts but I agree that it would be better to incorporate that into a tutorial-like section of the documentation. When @rhunwicks and I started discussing what became dbaugment that also was done on the WP blog, plus I believe Roger suggested using GitHub docs and much remains there (although it's mostly for developers). The datacopy feature came out of a discussion we had on the PGfoundry mailing list with Josep Martínez Vilà (who IIRC, like you, used SQL Server) around Oct-Nov 2012, but we're no longer using the ML due to lack of interest and PGfoundry problems. Anyway, I do know, because of Josep, that datacopy works, but it's certainly not well explained in the docs. Before you submit PRs, may I suggest we discuss an outline of the re-organized documentation. Based on what you wrote, the top level would be something like Overview/Intro, User documentation and Developer docs. The current Development, Testing and API Reference sections would go under the latter. The current Installation, Configuration, Known Issues and individual utility pages would be placed under User.

xclayl commented 5 years ago

I like that structure. I'd also add a top-level "Getting Started" page which runs through installing via pip and running yamltodb with a sample yaml file. The single-file/multiple-files adds complications for the docs, so I'm thinking the majority of the docs are in multiple-files mode and all the pages have a link to a page that describes how to convert between the two formats.

xclayl commented 5 years ago

If we move the pages around, we'll probably want to redirect the old URLs to new ones. This page describes how to do it. From what I can tell, we'd want "Page Redirects", not "Exact Redirects". https://docs.readthedocs.io/en/latest/user-defined-redirects.html?highlight=404 It looks like you login to ReadTheDocs to make the changes, so I assume you'll do that when it goes live. It might be possible to set them up now, b/c the docs give the impression that it only happens on 404s.

xclayl commented 5 years ago

I've got a start on it here: pyrseasclay dot readthedocs dot io (I don't want it indexed by google)

I'd like to rename the index.rst page to be "Overview", and the existing overview page renamed to features.

There's more I'd like to change, but I don't want to jump too far ahead. What do you think so far?

jmafc commented 5 years ago

IIRC index.rst is so-named because it becomes the home page of the documentation, i.e., so that you can visit /latest/ or /r0.7/ without having to add index.html. However, we can consider the index.rst page the "Overview" page without renaming it. And then overview.rst can be renamed features.rst.

I like the idea of a "Getting Started" and I think it could replace or be based on the current install.rst. Side note: I'd remove the "For development" from the Summary of that page, but still include GitHub installations instructions for users (somewhere), since sometimes they have problems doing that when trying out a fix (see issue #195, last comment).

I'd probably put the Known Issues after the User Docs or as one of the closing sections in the latter.

I'm not too concerned about the redirects. We can probably do this work in a docreorg branch that will not get published on readthedocs until it gets merged into master. Shortly afterward I would release r0.9 (with support for PG11 and most likely dropping 9.3) which would become the new stable and the old stable would be retained as r0.8.

xclayl commented 5 years ago

Looks like ReadTheDocs doesn't allow the index page to reference itself, so I'm thinking about duplicating some of the content into the Overview/Features page. Not completely sure about this though.

Getting Started is done. I'll move/remove the install page at some point. I'll include the docs for running pyrseas from git, but somewhere in an "advanced" section. I'm thinking it's not common to do, so let's not overload users with details they don't need to worry about.

Agree about Known Issues.

Sounds good about redirects. Definitely easier.

jmafc commented 5 years ago

I've created the docsreorg branch and taken a quick look at the gettingstarted.rst. Looks good although I would remove the Windows command line prompts. If you want to submit a PR to the docsreorg branch, go ahead.

xclayl commented 5 years ago

Do you prefer multiple PRs? I feel there's more to do.

I think a command prompt makes it clearer what users should do. Would a Linux prompt be better, for example: user@main:~$ yamltodb --version or make it very simple with just a dollar sign? $ yamltodb --version

xclayl commented 5 years ago

Looks like I've updated the PR unintentionally. I'm not a regular PR user, so this is just due to ignorance.

I'd still like to add a Reference section under User Docs, that lists every table, sequence, type, extension, and their attributes.

jmafc commented 5 years ago

I prefer a plain dollar sign prompt because it's used by most Unix and Linux shells (Cygwin too IIRC), and also by VMS :-).

jmafc commented 5 years ago

I think the Getting Started needs to go under User. and Known Issues too. In outline, it would look something like this:

I'm not entirely happy with "For Users". I also considered "How to Use", "User Instructions", "User Information" but none of them struck a chord either. The Facilities section can be created from the "Description" subsections of dbtoyaml.rst, yamltodb.rst and dbaugment.rst, i.e., it's intermediate betweeen the Getting Started and the Command Line Reference but hopefully avoiding much repetition.

I would also change the Getting Started to introduce dbtoyaml first. Running it against an empty database will output the standard public schema comment and the plpgsql extension, so the user can then either create a table with SQL or edit the YAML output and run yamltodb.

I'm not sure what you ultimately want to achieve with the Schema Reference section, so please elaborate. I think that instead of repeating much of the PG CREATE statements documentation, it may be preferable to include the sample YAMLs that are already in the API ref section, except that instead of using a JSON/Python dict format, it would be formatted like standard indented YAML and either annotated or described in subsequent paragraphs.

xclayl commented 5 years ago

Agreed, plain $ is better -- for me it removes the text clutter.

I haven't completely decided how to document the yaml, but definitely will have lots of yaml examples. Some properties, like identity (on a column) probably needs some text describing the two allowed strings. How about this: a datatype for each attribute, example yaml with the equivalent CREATE TABLE (or SEQUENCE, etc) for the yaml.

identity (string) 
  allowed values: "always", "by default"

example yaml:
  table mytable:
    columns:
      - mytable_id:      {not_null: true, type: integer, identity: by default}

example generated SQL:
CREATE TABLE myschema.mytable (
  mytable_id integer NOT NULL GENERATED by default AS IDENTITY
);

There's a good chance I'm going to get things wrong with all the possible yaml attributes. Unfortunately I don't know Python, so I'm doing my best to read through it.

Facilities and Getting Started have some overlap in answering, "what's this all about?". I want to point out that given a yaml file, the yamltodb program will alter the database.

Maybe I could change the description sections to be more explicit (show the yaml file, run yamltodb, and then show the output). Then make Getting Started more a Tutorial area (or How To). 1) Install components, 2) reverse engineer db, 3) add a column and deploy.

dbaugment is an example where I don't know what it does. Does it change the yaml files? Does it touch the database directly? I think an example walk through would clear this up. Similarly, a walk through with yamltodb (showing the yaml and the command you run) makes things much clearer.

jmafc commented 5 years ago

I will respond later to the above but I have a quick question: I'm wondering to what extent have you used or are you using dbtoyaml? The typical way I use Pyrseas in my own work is I'll CREATE or ALTER (i.e., with SQL) a given table, or sometimes other objects, then run dbtoyaml to generate a new YAML file. Then I run yamltodb to verify the change (in this case, yamltodb should generally produce no output if run on the same database). If modifying a view or function, and sometimes to make simple additions to a table, I may edit the existing YAML directly). Based on your emphasis so far on creating or editing YAML files, it seems you're relying more (or exclusively?) on yamltodb. I don't mean that as a criticism. You can use the tool the way you're more comfortable. It's just that dbtoyaml was created because what I perceived as a shortcoming in Andromeda, i.e., if you wanted to create some tables, you had to start by editing YAML files, but most DBAs would be more familiar with editing SQL scripts.

xclayl commented 5 years ago

It’s all about the yaml ;) I can right-click on a yaml file (a table for example) and see the git history of all the changes I’ve hand modified from the beginning.

I think this explains things. I’ve struggled to add new things to my yaml like triggers bc I couldn’t see how to write it. I feel like I’m reverse engineering Pyrseas by using dbtoyaml to find out what the yaml would be.

It’s a way of working that I don’t see anywhere. I don’t know why bc I think everyone should do this.

So for the documentation, I’d love to promote this way of working, but I don’t know how you feel about that. Maybe at a minimum we have pages for the different ways we envision users using Pyrseas.

xclayl commented 5 years ago

Microsoft has a similar product as yours, SSDT, and they also don't have pre & post SQL scripts. They went down a route that they called refactor log which I think you might be headed, however I'd discourage it, b/c it can't handle everything very well, https://stackoverflow.com/questions/23768919/with-ssdt-how-do-i-create-a-column-with-a-unique-constraint, and it's in a language that is custom to SSDT. Pre & Post SQL scripts are easy to understand and can handle any scenario.

jmafc commented 5 years ago

Regarding overlap between Getting Started and Facilities, I see the former as a quick start guide/how to (maybe it can be called Quick Start). For example, no need for PG install instructions, except for a short "see here". Why? Because Pyrseas only runs under Postgres, so I would expect people who find it, already have some version of it installed. Python may not be installed, so a brief instruction on installing Py 3 should suffice. Then create database, SQL create table with one or two columns, run dbtoyaml saving output, edit YAML to do something simple, like add a NOT NULL or make a PRIMARY KEY, and then run yamltodb to see the generated output (perhaps with a comment that adding -u will actually affect the database. Done.

Facilities OTOH is a much more detailed intro, discussing each utility and how they can be used (and perhaps including the more detailed Installation steps first). That's why I suggested taking the descriptive sections of the Command Reference pages. If you prefer, it can be titled Features, but maybe we should put the text together and then come up with a suitable title. This is going to be the meat of the user manual, the rest being more like a reference manual.

jmafc commented 5 years ago

Regarding dbaugment: It's a way to add "standard" objects to an existing database. Suppose you have several tables and you want to record when was a row added, who inserted it, and when was it last modified. Rather than adding those new columns to each YAML table spec, you specify them in a separate file that can be merged with the existing database to generate YAML that is then fed into yamltodb to actually modify it. This can also include trigger functions that will consistently update the "last modified" column on INSERT and UPDATE. It includes some pre-written column specs and functions so you can choose from them if desired. It was also supposed to allow for denormalization (controlled redundancy), e.g., to add calculated columns, and "history" tables (IIRC), but we never finalized that.

jmafc commented 5 years ago

Regarding promoting using yamltodb to the detriment of dbtoyaml. Aside from what I mentioned earlier (i.e., the lack of something like dbtoyaml in Andromeda), I think using yamltodb exclusively has other problems. First, it involves knowing and writing YAML instead of SQL DDL. Granted that YAML is easier than XML (as used in Liquibase and other products), but it requires knowledge of correct indentation, the names that Pyrseas chose for PG DDL attributes (they mostly follow the SQL names in lowercase, but not always), and as you say, having to reverse engineer stuff, whereas dbtoyaml gives it to you, nicely indented, in a consistent order, and knowing exactly what yamltodb will expect (because both utilities share the same underlying internal structures). If you handcraft YAML specs, you're more likely to make formatting or naming mistakes, or you'll have to painfully refer to documentation that to a great extent duplicates the PG SQL docs, so who's going to maintain that? Furthermore, some YAML attributes would have to be documented as being usable only on certain PG releases. I'm not sure if you're aware, but the API reference sections are generated mostly from the actual Python code and some comments. There's no equivalent mechanism to produce YAML spec detailed information.

I think YAML examples are good, and if you want to add more, there's a (near) perfect place for you to look: the unit tests under tests/dbobject. And we also have functional tests that, for example, show the progression of creating and altering some simple tables (even using datacopy!), and the autodoc and pagila databases. If you run the pagila-schema.sql script against an empty database and then run dbtoyaml you'll get a pretty comprehensive YAML example.

xclayl commented 5 years ago

I see why you want to talk about dbtoyaml first - it's what you envision users to do.

dbaugment: "you specify them in a separate file that can be merged with the existing database to generate YAML that is then fed into yamltodb to actually modify it" It sounds like you use it in place of dbtoyaml. It works the same, but takes another input file with the augment information.

You make some valid points about maintaining the documentation -- it'll be extra work to keep them up to date. I had a look at JSON schemas as an alternative to documentation, but intellisense in vscode isn't quite there with yaml. Works great with JSON files (.json). So to get this to work well, Pyrseas would have to read .json files in --multiple-files mode. I'm not sure what you think about handling json schemas and *.json files.

jmafc commented 5 years ago

dbtoyaml is akin to pg_dump -s (the former was in fact modeled on the latter). dbaugment is not used in place of dbtoyaml. dbaugment is like an editor script over the output of dbtoyaml to automatically add stuff (hence augment) to it that isn't there. There is a product named apgdiff (see my Schema VC page) which compares the ouput of two pg_dump -s output scripts and gives you an SQL script to bring the two in sync, i.e., it does more or less what yamltodb does (but because parsing SQL DDL isn't easy apgdiff it had a limited coverage of PG object types--although I see on GitHub they've been adding to it). With apgdiff, in order to do what dbaugment does, you'd have to edit the second pg_dump -s output by hand or create a sophisticated editor script to do the work. For example, if you have 30 tables and you want to add a modified timestamp column to 20 of them, you'd have to search for each CREATE TABLE statement and insert, before the closing parentheses, modified_timestamp timestamp with time zone. With dbaugment all you need to do is create or edit a configuration file listing the tables and desired column and then run dbaugment. The output can then be fed to yamltodb to actually add the columns (and any functions and triggers necessary).

YAML is compatible with JSON and although yamltodb -m looks for files ending in .yaml, it can read JSON syntax. However, I would never want to downgrade the input format to only JSON, because YAML is so much easier to read and input because of the indentation and not needing squiggly braces, commas and much fewer quoting. Plus, as I said above, the YAML output/input of dbtoyaml/yamltodb follows the pattern of pg_dump -s, so I remain unconvinced that the Pyrseas user documentation has to go into much detail.

As I see it, the user YAML reference documentation should (a) describe a particular PG object, say table, (b) provide a link to the CREATE statement in the PG documentation for the given object, (c) show an example of what the YAML output/input looks like, e.g., as shown here but formatted as indented YAML, and (d) if necessary, provide further information on where the YAML differs from the DDL syntax, e.g., UNIQUE constraints are grouped under unique_constraints (as a group) and cannot be specified at the column level (and cross-reference this to a section under Constraints - Unique that shows a more detailed YAML example of such a constraint).

xclayl commented 5 years ago

I prefer walk through documentation (create this file, create that file, run dbaument, see you have extra columns/triggers/etc in the newly created file) more than descriptions. It's more concrete I guess. But you could have both.

Yeah, I like writing/reading yaml more than json as well. Fewer symbols make it easier to digest.

Documenting the yaml isn't interesting to someone who uses dbtoyaml rather than editing yaml by hand. If you know what you created (a table for instance), then the generated yaml is familiar.

However, I still want to edit the yaml by hand. If I had to choose, I'd prefer to use a json schema to give me intellisense in vscode for my yaml files, rather than documentation. I know what a foreign key is. I just want to know how to write it in yaml. cntrl-space -> oh, it's called "foreign_keys".

What if I were to create a json schema file, would you include it in the project? In particular, update the tests to validate the schema whenever they output yaml? In practical terms, it'd mean having separate dependency to handle the schema, https://stackoverflow.com/questions/3262569/validating-a-yaml-document-in-python

xclayl commented 5 years ago

I've been working on a JSON schema for the yaml (attached).

pyrseas.schema.schema.json.txt

pyrseas.db.schema.json.txt

It makes it much easier to write the yaml. If you want to have a play:

  1. install Visual Studio Code https://code.visualstudio.com/download, available on windows, mac & linux.
  2. create an empty folder and open Visual Studio Code to the folder
  3. copy the attached two files to the folder and rename them without the *.txt extension (txt was required by github)
  4. create a folder called .vscode in the folder. In there create a file called settings.json with the following contents:
    {
    "yaml.schemas": {
        "./pyrseas.db.schema.json": [
            "schema.*.yaml",   
            "extension.yaml"
        ],
        "./pyrseas.schema.schema.json": [
            "table.*.yaml",   
            "function.*.yaml",   
            "sequence.*.yaml"
        ],
        "http://json-schema.org/draft-07/schema#": [
            "pyrseas.db.schema.json",
            "pyrseas.schema.schema.json"
        ]
    }
    }
  5. install the redhat yaml extension for Visual Studio Code.

To try it out, create a file called table.test.yaml hit cntr-space whenever you want a suggestion.

image

The attached schema isn't exhaustive, but for a play, it'll do most things you'll be interested in. It covers everything in my personal project and the pagila schema in the tests.

jmafc commented 5 years ago

I'm sorry, Clay, but let's just say I'm too set in my ways. I do have VS on a Windows laptop but it lays mostly dormant. So I think you'll need to explain to me how do you envision those two files being distributed and used with the existing Pyrseas code and how will it affect the documentation. From a distribution standpoint, I presume it would fit into what Python calls package_data (see setup.py), i.e., files that get distributed but are not directly part of the running/executable software. From an end-user perspective, I guess they would be used by people who want to create/edit YAML for input to yamltodb without using dbtoyaml, and who are willing to install something like VS Code to assist with that. Side question: could these "schema" files also be used with Atom or Sublime Text, or plugins for Emacs or Vim? It seems you also want someone (who?) to write some Python as a validator program that takes a YAML file, whether fully hand-crafted or created with assistance of the "schema" files, and validates it prior to submission to yamltodb.

As far as documentation, it looks like you would not have what is now the spec-reference section and instead would discuss how to use the JSON "schema" files. I still think we need examples in the end-user section, formatted in YAML, because even if you use dbtoyaml someone may want to understand the details (perhaps to build onto it, as some users have done).

Lastly is the issue of tests. Currently, unit tests use Python dicts which are compatible with JSON and therefore with YAML. The only tests that generate YAML or take YAML as input are the functional ones, and they all use the same approach: create/alter a "source" database, run pg_dump -s and dbtoyaml against it, use the YAML from the previous step to run yamltodb against a different database ("target"), run pg_dump and dbtoyaml against the second database and compare the ouputs to the previous runs, then repeat the process from the third step but using an empty YAML to ensure we drop objects in the correct order. I'm not sure where it would make sense to validate against the JSON schema files (but the validation code would have to be written first).

xclayl commented 5 years ago

Just avoid potential headache - VS Code is completely different from Visual Studio. I think VS Code is written in javascript, so probably totally different from what you have installed.

The idea with the schema files would to publish them to a website. So there is no need to distribute them to users. Hopefully there is a free site for schemas or static file web hosting. Worst case scenario would be Amazon S3 hosting from my account.

In theory other text editors could implement this, but with a quick google search, I didn't see anything. it seems that JSON files are getting the json-schema support first (json-schema is still very young - just a draft).

It seems you also want someone (who?) to write some Python

It's a good point about Python coding. I know I won't be able to do this, having had a look around the code. So what I'll ask is if you could update the tests to validate any generated yaml (which the purpose would be to validate the schema, rather the yaml). If you don't have time (I know you've already spent lots of time on this project), then I'd just publish the schemas and add docs if people want to configure this. And just for VSCode to start. If someone wants to figure out another IDE, then I think it's reasonable that they can help with the docs. The downside would be potentially the code and schema will become out of sync. Worst case is there will be a Github issue to fix it, which wouldn't take me long to do. The schema wouldn't (and I think shouldn't) affect the running of the app.

So no need for spec-reference section. :) But I agree samples are good.

jmafc commented 5 years ago

If you're going to the trouble of creating the files, we might as well distribute them with the Pyrseas package. It's not a big deal, like I said, they will be marked as package_data and deposited in a subdirectory of the install directory. But we could, in addition, host them on https://perseas.github.io/.

Did a quick look for an Atom package and found json-schema among others (there's also YAML Atom-IDE support and others). Sublime Text has Schema Validator, Vim has vison. For Emacs there's flymake-json. I may install the latter, but no promises.

As I wrote before, none of the Pyrseas tests currently generate YAML, only Python dicts. The functional tests use dbtoyaml to generate YAML. If the code to validate YAML were written, perhaps the test_autodoc and test_pagila tests could be modified to run the validator against the dbtoyaml-produced file.

xclayl commented 5 years ago

It'll be easier to describe the file path using a URL, so that it is the same for everybody. Also, if the settings file is committed to source control, the path to the schema may not be the same for everyone. Lastly, some tools allow the path to the schema to be in the yaml file itself ($schema property), which ideally would be a URL.

OK, let's skip modifying the tests.

jmafc commented 5 years ago

I'm not sure to what "file path" you're referring. Is that a path to a validation file, analogous to XML uses for validation?

xclayl commented 5 years ago

Yes that’s what I mean.

xclayl commented 5 years ago

I've pushed up the complete schema for the yaml files in my pull request.

In terms of hosting, you may want to consider http://schemastore.org/json/, which will host the schema files for us. From what I can tell, you do a pull request to a github repo to add/update a json schema.

jmafc commented 5 years ago

Although I merged your latest pull request, I haven't yet had the time to look at it, but I just ran (again) into something that makes me (again) doubt the advisability of simply editing YAML files and not using dbtoyaml. Say you have a char column for which you want to add a CHECK constraint on, something like CHECK (col IN ('abc', 'def', 'ghi')). How would you (or any "intellisense") know that (currently) the correct way to write the expression attribute is (col = ANY (ARRAY[abc'::bpchar, 'def'::bpchar, 'ghi'::bpchar]))? Another example that comes to mind is a view that you write using EXTRACT(YEAR FROM some_date_col) and which PG instead insists in storing and subsequently displaying as date_part('year'::text, x.some_date_col)).

jmafc commented 5 years ago

Here are other much simpler cases. If you have a CHAR(n) column and a VARCHAR(n) column and want to specify the default for both as (in SQL) '', in the YAML you have to use is '''''::bpchar' and '''''::character varying', respectively. If you want to specify 0 as the default for an integer, in YAML it has to be '0'. A default of standard SQL CURRENT TIMESTAMP for a timestamp colum? Use now() instead. And CURRENT DATE for a date? Use ('now'::text)::date. Granted, yamltodb could accept the more standard representations and adjust them to what PG is expected to store in its catalogs, so that it wouldn't generate spurious differences, but that would be extra work (and then someone would ask dbtoyaml to output the more standard reprs as well).

xclayl commented 5 years ago

I had a similar situation happen to me last week. I think I had to write a check constarint something non-obvious like ‘-1’::integer, to your point. But it’s not a horrible situation (and similar things happen with SQL Server and SSDT) So for a few days the deployment would drop and recreate the check constraints. Not ideal, but it works. Eventually I got tired of this and found the pg_constraints view which showed the string Postgres expected and now it doesn’t drop and recreate anymore.
One option is to show the existing constraint expression if it is being replaced/overwritten.