nextstrain / augur

Pipeline components for real-time phylodynamic analysis
https://docs.nextstrain.org/projects/augur/
GNU Affero General Public License v3.0
268 stars 128 forks source link

augur export command line syntax #308

Closed jameshadfield closed 5 years ago

jameshadfield commented 5 years ago

This issue follows on from PR #305, specifically these comments:

@tsibley

I would encourage the commands to be augur export v1 and augur export v2 and explicitly not support something like augur export default or augur export latest. The reason is that if builds always have to explicitly pick a version, they won't break when the default/latest changes and can update to the new version on their own terms/timeline.

@trvrb

I'd like to not break existing builds if possible. I don't quite see why we can't just have 'augur export -- v1' and 'augur export --v2' and have 'augur export' default to 'augur export --v2'. If augur is upgraded, an existing build would go to producing v2 JSONs but 'auspice view' should transparently work with this.

I think of this like docker where the default is latest, but you can pin to a version if desired. I'm not sure I like forcing a version. This will just result in people never upgrading to v2.

jameshadfield commented 5 years ago

I would like "v2" (i.e. "unified") to become the default, while maintaining backwards compatibility and supporting meta+tree jsons for a while at least.

Question: what happens if, for instance, next month we add a "logo" field to the JSON. The JSON version then gets pushed to v2.1? How should this affect how augur export is called?

tsibley commented 5 years ago

I'd like to not break existing builds if possible.

I agree. It would be really nice to keep back compat. We should aim for that in whatever we do!

If augur is upgraded, an existing build would go to producing v2 JSONs but 'auspice view' should transparently work with this.

I'm not sure I understand. This seems like it would only be true if all JSON versions are backwards compat (i.e. only adding new optional fields, never renaming fields, no new required fields, the same files, etc), which if that was the case, we wouldn't be having this conversation. What's your thinking on how this would work?

This also assumes Auspice is upgraded when Augur is (to support any new JSON version), which doesn't seem like a certainty to me.

I think of this like docker where the default is latest, but you can pin to a version if desired.

Using the "latest" tag for container images in a build/deployment you want to be robust is pretty widely considered bad practice. Instead, people explicitly declare the version/tag they want. The reasons are the same here: with "latest" you often hit an upgrade unexpectedly in ways that break your work.

This "surprise upgrade" problem is one thing that I heard expressed as a common source of frustration with augur/auspice at the package level, and it informed how I built the Nextstrain CLI, for example. I think we should avoid that same issue at the command/data format level.

There's a lot of prior art here. This kind of versioning issue is not new, and the field's simplest solutions always tend towards explicit versioning.

This will just result in people never upgrading to v2.

I don't think this is true. People will upgrade when they a) want the cool new feature that's only available in v2 or v3 or v4, etc or b) when we deprecate and remove v1 after a few release cycles. I think supporting a few versions in a rolling window will let people's builds keep working while still giving them plenty of time (and warning) to upgrade.

I also think people will get more confidence in upgrading Augur as a whole (and thus do it more frequently) if their use of it doesn't break when they do. I see auto-upgrading their JSON version as a potential source of breakage that would, once bitten, cause them to be twice shy about upgrading Augur at all.

trvrb commented 5 years ago

Okay. Thank you for the thoughts @tsibley. This is very clarifying and you make really good points (and are generally more knowledgable about this sort of stuff than I am).

How do we handle the following (I imagine common) scenario:

I had assumed above that the best way to get a user to working system would be to do as I was alluding to above and have augur export produce v2 JSONs by default and everything will immediately work when a user upgrades to latest versions. Note that a user could still explicitly specify augur export --schema v1 if they wanted to keep using auspice 1.0 in a stable fashion.

Using the "latest" tag for container images in a build/deployment you want to be robust is pretty widely considered bad practice. Instead, people explicitly declare the version/tag they want. The reasons are the same here: with "latest" you often hit an upgrade unexpectedly in ways that break your work.

I had thought of this like so: augur v5.0 produces --schema v1 by default, while augur v6.0 produces --schema v2 by default. If people are using semantic versioning their builds shouldn't break.

This seems like it would only be true if all JSON versions are backwards compat (i.e. only adding new optional fields, never renaming fields, no new required fields, the same files, etc), which if that was the case, we wouldn't be having this conversation. What's your thinking on how this would work?

I had assumed (but may definitely have been wrong) that auspice server would do a schema migration before handing auspice v2 data structure. This (https://github.com/nextstrain/auspice/blob/master/cli/server/getDataset.js#L128) had made me think the current server was doing this, but I see now that's it's combining JSONs but not restructuring them.

I believe that we need something like this if nextstrain.org/community will work. We can't expect that everyone will upgrade existing JSONs and if we are going to pull these JSONs into auspice 2.0 served from nextstrain.org, we'll need to do a migration.

jameshadfield commented 5 years ago

I had assumed (but may definitely have been wrong) that auspice server would do a schema migration before handing auspice v2 data structure. This (https://github.com/nextstrain/auspice/blob/master/cli/server/getDataset.js#L128) had made me think the current server was doing this, but I see now that's it's combining JSONs but not restructuring them.

v1 -> v2 JSON conversion is happening with auspice v2 -- see https://github.com/nextstrain/auspice/blob/v2/cli/server/convertJsonSchemas.js#L269. Auspice v1 (i.e. current master) never actually used v2 JSONs 😱

trvrb commented 5 years ago

Ah! Thank you for clarifying this James.

tsibley commented 5 years ago

Yep, I did assume we would keep Auspice support for older schema versions (via schema upgrades/migrations/mappings). At the least, this would need to stay in lock-step with what the latest Augur supports.

I had thought of this like so: augur v5.0 produces --schema v1 by default, while augur v6.0 produces --schema v2 by default. If people are using semantic versioning their builds shouldn't break.

I agree in principle that a default schema version could be bumped alongside an appropriate package version bump, but I think in practice requiring people to always provide an explicit --schema vX (or whatever the syntax is) ends up with friendlier and less surprising behaviour.

We do have a one-time transition to this schema versioning system for the current augur export. I think an important consideration is that it's a one-time cost to get us to a system that won't extract that cost every time. Here's what I see, which I think is (or at least complements) what you proposed:

trvrb commented 5 years ago

Okay. This makes sense @tsibley. I'm on board for your described plan with explicit schema versioning.

rneher commented 5 years ago

I think this is a sensible plan. I don't like the sub-subcommand idea though (augur export v1). All the user wants to do is export and specifying a format for export makes sense to me. In this light, I would prefer --format or --output-format over --schema because I feel that schema has a much more general meaning as the specific one we use here and "file format" is a notion most people a comfortable with.

emmahodcroft commented 5 years ago

This is perhaps a bit of a side issue and one I know I've mentioned before but I feel strongly about it and would like to say again: whatever we decide to do, can we put in place checks whenever possible to look for 'old' commands and tell the user exactly what they need to change to get their build working (and, if applicable, what feature(s) won't work anymore, etc).

Speaking from my own past behaviour, while we we may 'gently depreciate and then remove' something, a lot of people will never see this. They will update software every year or two and it will suddenly not work, and by that point the breaking change documentation is buried so deep on a blog or helpfile it's useless. This used to happen to me, and the answer was immediate roll-back, and try to set aside an entire day sometime to try and figure out what I had to do to get my runs working. I hated every single minute of it. During my last post-doc, this became such an issue that Manon, Gonzalo, and I kept an old install file for one piece of software like sacred gold to install on any new servers/computers/etc because it just worked with all our existing scripts, and we'd never been able to figure out what the breaking change was (not worth our time!!).

I realise putting this in will take some extra time on our part, to recognise old commands and explain them to users (including perhaps 'you will need to also upgrade auspice, instructions for this can be found here'), but it would have absolutely revolutionised how I treated upgrades and made my life a thousand times better if other programs had done this! And, once people realise that even breaking changes will be accompanied by useful info (in the error, not on a webpage somewhere) on how to get going again, they won't be afraid to upgrade!

kairstenfay commented 5 years ago

This is perhaps a bit of a side issue and one I know I've mentioned before but I feel strongly about it and would like to say again: whatever we decide to do, can we put in place checks whenever possible to look for 'old' commands and tell the user exactly what they need to change to get their build working (and, if applicable, what feature(s) won't work anymore, etc).

Speaking from my own past behaviour, while we we may 'gently depreciate and then remove' something, a lot of people will never see this. They will update software every year or two and it will suddenly not work, and by that point the breaking change documentation is buried so deep on a blog or helpfile it's useless. This used to happen to me, and the answer was immediate roll-back, and try to set aside an entire day sometime to try and figure out what I had to do to get my runs working. I hated every single minute of it. During my last post-doc, this became such an issue that Manon, Gonzalo, and I kept an old install file for one piece of software like sacred gold to install on any new servers/computers/etc because it just worked with all our existing scripts, and we'd never been able to figure out what the breaking change was (not worth our time!!).

I realise putting this in will take some extra time on our part, to recognise old commands and explain them to users (including perhaps 'you will need to also upgrade auspice, instructions for this can be found here'), but it would have absolutely revolutionised how I treated upgrades and made my life a thousand times better if other programs had done this! And, once people realise that even breaking changes will be accompanied by useful info (in the error, not on a webpage somewhere) on how to get going again, they won't be afraid to upgrade!

This is an excellent point. The better we can make our error messages (including links to external documentation), the happier our users will be.

trvrb commented 5 years ago

Very good point @emmahodcroft.

@rneher: I also prefer not to make this a subcommand. However, --format and --output-format make me think the options will be JSON vs YML etc... I think of file endings with the word format. If you'd like to avoid the word 'schema', we could consider something like --version v2. But really this is a weak preference. I'd be pretty much equally happy with --schema, --format and --version.

kairstenfay commented 5 years ago

One of the primary reasons @joverlee521 and I chose to use export v1/v2 subcommands in development over a named argument like --schema (etc.) is because it decluttered the --help message.

With subcommands, you don't get a help message containing v1-specific arguments if you're using v2, and vice-versa.

tsibley commented 5 years ago

The isolation of options/arguments for v1 vs. v2 is, I think, a huge usability win for export v1/v2 subcommands.

@rneher @trvrb Why do you dislike the subcommand approach?

trvrb commented 5 years ago

I see. This makes sense. My opinion was based purely on semantics. I think of subcommands as doing different things, while v1 vs v2 felt like minor distinction. But the upgrade flipping around other command line arguments does add obvious confusion.

jameshadfield commented 5 years ago

re: sub-sub commands. I agree in principle, but the decluttering of arguments makes such a huge difference. I don't know a different solution here 🤷‍♂


@emmahodcroft completely on board with this. Right now, on branch v6 running an "old" command results in:

augur export: error: the following arguments are required: Augur export now needs you to define the JSON version you want, e.g. augur export v2.

But we can add much more here, including URLs to the docs.


Another tangent: I worry that all these version names are too confusing -- e.g. augur v6, auspice v2, JSONs v2 all mean different things and a lot to keep track of.

trvrb commented 5 years ago

@jameshadfield: Can we keep JSON version tied to Auspice version? Would this be smart? This is pretty effectively semantic versioning.

jameshadfield commented 5 years ago

Can we keep JSON version tied to Auspice version? Would this be smart? This is pretty effectively semantic versioning.

Yes... we could. As long as we're happy with situations (e.g.) where auspice v3 doesn't change the JSONs at all but we call them v3 jsons anyway (i.e. v2 jsons are identical to v3 jsons). Not sure if this is a great idea but it does reduce cognitive load and is easy to explain.

trvrb commented 5 years ago

I think this makes a decent amount of sense, ie "Use v3 JSONs if you're using auspice >3.0". Easy to remember and easy to know what change export to / easy to know what to upgrade auspice to.

The biggest issue I'd possibly see is having some new fancy feature that would make you want to call auspice 3.0, but where we don't need to upgrade to v3 JSONs. In this case, I'd be okay with a feature release (to 2.1 or whatever) to keep with semantic versioning.

rneher commented 5 years ago

no strong feelings against subcommands. my thought is that a user wants to export the analysis with mostly identical inputs which are processed and assembled in a slightly different way and exported in a format the user doesn't actually care about since the only requirement is being compatible with the auspice version used. Having different commands for the same thing seems weird to me. but if it is cleaner to implement that way, I have no real objections.

emmahodcroft commented 5 years ago

@jameshadfield Great! Yes, I just found this testing out the v6 branch. It would be good if we can include a link to some page 'What version should I be using?' where we can help people figure this out in some detail (and explicitly draw the line between export version and what version of Auspice it will work with).

jameshadfield commented 5 years ago

What are people's thoughts on the proposal to have the JSON version synonymous with the auspice (major) version? Certainly it's easy to explain in the docs and makes sense in that you are "exporting data for auspice"...

Breaking changes to the JSON will require a major auspice version bump, so it's good in that direction.

The flipside is that major changes to auspice may be unrelated to the JSON, and therefore we'll be in a situation one day where , e.g., auspice export v2 is identical to auspice export v3. Are people ok with this?

tsibley commented 5 years ago

@rneher Good point that the user likely doesn't care about the actual format. I agree! If we could guarantee a stable interface for generating the different versions of Auspice input formats and thus use a single command, that'd be ideal. I don't think we can, though, as evidenced by the divergence between v1 and v2 already. Maybe they could be reconciled more than I think?

@jameshadfield Re: keeping versions in lockstep with Auspice, I think it would simplify things on the whole, although introduce other complications. For example, since the idea is that Auspice would keep supporting some older versions of the JSON, it makes it simpler to explain that Auspice v2 JSONs are supported by Auspice v3, but not vice versa. This is complicated by and misleading if v2 and v3 JSONs are identical except in version number. It seems like we'd sometimes end up asking people to change a 3 to a 2 to make a JSON work with their Auspice version. :-/

tsibley commented 5 years ago

Other than that, I don't have much of an issue with v2 and v3 JSONs being potentially identical.

trvrb commented 5 years ago

I'm in favor of tying JSON version to auspice version. I'd be completely alright with going from auspice 2.0 to auspice 2.1 even if there are major new features as long as auspice 2.1 still uses v2 JSONs. This is the essence of semantic versioning (where versions are based on compatibility and not marketing). Ie I think the only time we'd need to bump auspice to a new major release is if JSON spec updates.

@jameshadfield: Can you envision at change to auspice that would require a bump to 3.0 regardless of JSON spec? I suspect there are cases I'm missing.

tsibley commented 5 years ago

One such case would be the extension API/config/auspice build interface changing.

jameshadfield commented 5 years ago

Can you envision at change to auspice that would require a bump to 3.0 regardless of JSON spec? I suspect there are cases I'm missing.

There are a few here -- extension (build config) API // client-server API // CLI interface // available dataset JSON format // narratives format

If we could guarantee a stable interface for generating the different versions of Auspice input formats and thus use a single command, that'd be ideal.

augur export default would be ~the~ one way to do this (I know people don't like this, but it'd work...)

tsibley commented 5 years ago

augur export default would be ~the~ one way to do this (I know people don't like this, but it'd work...)

I don't see how it would work. Can you explain more?

trvrb commented 5 years ago

After cogitating on this for a while, I have the following preferences:

  1. I do think it will be cleaner to pin auspice JSON schema version to auspice version. This seems to fit with normal application behavior. My analogy would be WordPerfect. WordPerfect 9.0 saves WordPerfect 9 documents, but it can transparently open WordPerfect 6, 7 and 8 documents. If WordPerfect is upgraded to 10.0 it will save WordPerfect 10 documents even if not strictly required to bump document version.   This preference is a moderate one, but I'd be very okay splitting JSON version and auspice version as well.

  2. I don't see a cleaner way to do the command line syntax than using subcommands, eg auger export v1, augur export v2, etc... because of the help issue.

  3. I understand @tsibley's desire for never surprising user (https://github.com/nextstrain/augur/issues/308#issuecomment-507852789). But at this point I would prefer allowing augur export default as an option that would say use augur export v1 in augur 5.0, but start using augur export v2 in augur 6.0. If someone has a snakefile that specifies augur export default and then they upgrade augur to 6.0 and this breaks the build due to changes in command line syntax, then the resulting error should be obviously interpretable. Ie if --title doesn't exist in augur export v2, but did exist in augur export v1 then argparse will throw the appropriate error about --title being invalid.   However, I would keep still explicitly call augur export v2 in our Nextstrain builds.   Allowing augur export default is a slight preference.

tsibley commented 5 years ago

:+1: in general to @trvrb's prefs. I still have reservations about augur export default (and don't understand why it would be desired to use it), but if it's to exist, then I think augur export latest might be more descriptive.

jameshadfield commented 5 years ago

Assigning @trvrb & @tsibley to this issue. Thanks in advance.

emmahodcroft commented 5 years ago

Hypothetical:

I am using augur export latest. When I start, this is 'secretly' augur export v2. I use augur regularly but in a basic way, so I don't pull/update for a year.

Maybe I move to a new computer and so reinstall augur (which is now the latest one), or update without really thinking. augur export latest is now 'secretly' augur export v4.

When I try to run my build with snakemake and some new data, I get a ream of errors to fix. They're clear, but I don't have time to address them right now - I have a presentation and I just need it to work! I'm advised to use an earlier version of augur export.

Since I am running in snakemake, when it started running, it deleted my old auspice/newvirus.json file - with the version in it.

How do I find out what version of augur export I was running/outputting before, so I can switch to using the explicit command to get it working now?

I by no means think this is insurmountable but I've been in situations like this (though admittedly, worse, because there's usually NO version info), and it's so frustrating - so we should have a plan!

(Perhaps an alternative, if we always require explicit version (no augur export latest) - after pulling/updating, it detects you're using augur export v2 and prints: "Augur export has been updated to version 4! To make the most of our new features, we recommend you switch to using augur export v4. It's easy to make the changes to get running in v4. First follow our guide to move from v2 to v3 here: xxx, then move from v3 to v4 here: xxx. To suppress this message until the next release, use the --suppressv4 argument.")

For me the barrier to updating stuff is always that a) it's supposed to be easy and it never is, b) there's no good guide on what I need to change, c) it always happens exactly when I don't have time for it. Whichever way we choose, it should get that barrier really low!

We should probably also, either way, say something like "By this way, if you haven't already, you'll need to upgrade Auspice to view this new output format." I'm not sure that'll be obvious to people.

trvrb commented 5 years ago

I'm sure this is exactly what Tom was thinking about with his recommendation, but thank you Emma for spelling this out. This is very clear and persuasive. I'm fine with not allowing 'augur export latest' and requiring specific versions instead.

tsibley commented 5 years ago

It was indeed. :-) Thank you, Emma!

emmahodcroft commented 5 years ago

Are we happy with this now, or are there other things in this issue we need to discuss before closing? :)