jqlang / jq

Command-line JSON processor
https://jqlang.github.io/jq/
Other
30.58k stars 1.58k forks source link

enhancement request: import and module #425

Closed pkoppstein closed 10 years ago

pkoppstein commented 10 years ago

Since jq's version number is greater than 1, jq urgently needs some kind of module system that will help avoid naming collisions. The recent addition of the env filter highlights the need. Had jq modules been available, such a system-dependent function could well have gone in a jq-provided "System" module, or users could have protected their own function named env by putting it in a module of their own.

PRIMARY GOALS

[1] Provide a mechanism for avoiding namespace collisions.

This includes the ability to avoid namespace collisions not only of jq functions but also of named collections of functions (modules( themselves.

[2] Support the packaging of related functions.

There has been some discussion about "libraries" or "packages" of functions, e.g. for Unicode support. Providing these libraries as modules would facilitate their description, versioning, dependency management, etc.

[3] Support the definition of evaluation contexts.

Proposed additions to jq such as eval would benefit from module support, e.g. jq 'eval(STRING, MODULE)' would very roughly be like:

jq -f <(cat MODULE_CONTENTS ; echo STRING)

That is, jq would compile STRING in the context of MODULE, and then filter the input accordingly.

PROPOSED SYNTAX

Summary

IMPORT::

import RESOURCE [as IDENTIFIER];

MODULE::

module IDENTIFIER
  JSON_OBJECT?
  IMPORT*
  DEFINITION*
end

This proposal introduces two new reserved words:

import
module

However, if for some reason "as" cannot be used as a keyword here, then "alias" would be recommended.

Invocation of a function defined in a module

Invoking a function, f, defined in a module, M (or in a module aliased as M):

M::f

Rationale: i) Using . or even ":" as the separator raises too many issues. (*)

ii) Commandeering a special character such as & to serve as a prefix sigil (as in &M.f) would be wasteful and probably more confusing than helpful.

Module Definition

module MODULENAME
  JSON_OBJECT?
  IMPORT*
  DEFINITION*
end

where:

JSON_OBJECT if specified is a JSON object that can be used for giving details about the module (version, author, etc);

IMPORT ::= import RESOURCE [as ALIAS];

DEFINITION is a jq function definition;

RESOURCE is a JSON entity specifying a file or URL; for example, the string "http://modules.jq.org/unicode.jq"; the referenced file or URL should be a valid jq program.

The "import" directive allows one module to "include" another. The function definitions so included become available both within MODULENAME and wherever MODULENAME functions are available. In all cases, however, the MODULE::FUNCTION syntax must be used. There is no nesting of modules.

Rationale: The proposed syntax for module definitions allows existing function definitions to be "copy-and-pasted" into a module, and yet is flexible enough to support other functionality. The 'end' keyword already exists and is appropriate since each IMPORT and each DEFINITION is terminated with a semicolon.

Example:

module MyModule def id(x): x; end

Loading a Module

For the initial implementation, it would be sufficient simply to allow the '-f FILESPEC' option to be specified multiple times. For example:

jq -f MYMODULE.jq -f MYPROGRAM.jq

would be equivalent to jq -f <(cat MYMODULE.jq MYPROGRAM.jq)

The major enhancement would be to support the "import" directive generally, i.e.

import RESOURCE as NAME;

This allows a module to be imported as though it were named differently.

Module Description

The JSON object included in the definition of one module can be used to include a description of the module, its version number, etc.

Relative Paths

The I/O enhancemnts for jq that are underway may obviate the need for additional options, but if not, one possibility would be for jq to support the concept of module paths. These could, for example, be specified using a "--path" option.

Proposed Simplifications

  1. Module names must begin with a capital letter (ASCII).
  2. If a function, f, is defined within a module, M, then except within M, all invocations of f must use the form M::f.
  3. One module can import another, but modules cannot be nested.
  4. If a RESOURCE defines more than one module, then the "import RESOURCE as NAME;" form cannot be used.

    Additional Features

Needless to say, there are many other possible enhancements beyond the above skeletal proposal, but hopefully whichever enhancements are adopted can be built on the foundations of the basic module system described above.

Footnote:

(*) If M is both a user-defined module and a user-defined function, then M.f would at best be ambiguous; at worst one would shadow the other, defeating one of the goals for having a module system in the first place.

As for using ":" as the module/function separator, it has been observed that expressions such as {"a": M:a} may be more difficult to read than {"a": M::a}.

nicowilliams commented 10 years ago

The need is much less severe than you think because each def replaces previous ones for the purposes of binding subsequent defs. Therefore the addition of new defs to builtin.c does not cause backwards compatibility issues.

nicowilliams commented 10 years ago

Proof:

$ echo $SHELL
...
$ jq -n env.$SHELL
<same as above>
$ echo 'def env: {"SHELL":"ha! builtin env overridden as expected"};' >> ~/.jq
$ jq -n -r env.SHELL
ha! builtin env overridden as expected
$

:)

pkoppstein commented 10 years ago

@nicowilliams -- Yes, I was aware of the overwriting feature, and yes, I realize that if a user had a library file that defined env, then that user's old programs need not change, but if the user now wants to use the new builtin env in a jq program that requires the library, he or she will have to change the library. Whether you call that a "backwards compatibility issue" or not, it is an issue -- I would say a major issue with respect to software with a version number greater than or equal to 1.0.

nicowilliams commented 10 years ago

It's a minor issue, really. But we agree that there should be a bit more syntax for including libraries. Right now all we have is ~/.jq, and that's a bit lame -- it's not remotely the right approach for production applications, for example. What the result ends up looking like is still up in the air, and I won't tackle it until the I/O and other higher-priority work is done. But if you send me a PR then I'd have to look at it sooner than later :)

nicowilliams commented 10 years ago

See also #112.

nicowilliams commented 10 years ago

@wtlangford I'm thinking we need a pseudo-opcode by which the parser can encode the desire to import some library. Then in jq_compile_libs_args() (and jq_compile_args()) we'd extract these imports, find and parse the given library, rename defs in the parsed result, then block_bind_referenced() the result to the body of whatever wanted the import (which might be a library).

In compile() we'd then ignore this new pseudo-code.

There's more details (e.g., private vs. public symbols). But I think that's a decent sketch.

wtlangford commented 10 years ago

@nicowilliams Sounds fine to me, but what do you mean by rename defs?

nicowilliams commented 10 years ago

@wtlangford I mean adding a prefix, for namespace management purposes, import foo as bar.

nicowilliams commented 10 years ago

@wtlangford This change allows for the use of '::' in identifiers:

diff --git a/lexer.l b/lexer.l
index b51ab1f..3a74ef4 100644
--- a/lexer.l
+++ b/lexer.l
@@ -114,6 +114,7 @@ struct lexer_param;

 [a-zA-Z_][a-zA-Z_0-9]*  { yylval->literal = jv_string(yytext); return IDENT;}
+[a-zA-Z_][a-zA-Z_0-9]*::[a-zA-Z_][a-zA-Z_0-9]*  { yylval->literal = jv_string(yytext); return IDENT;}
 \.[a-zA-Z_][a-zA-Z_0-9]*  { yylval->literal = jv_string(yytext+1); return FIELD;}

 [ \n\t]+  {}

A relatively simple change to parser.y will parse "import" declarations. Then we need to code up a function to generate the block representation of defs, and modify jq_parse*() to check for declared imports, load each library, rename its symbols as appropriate, then block_bind_referenced() each loaded dependency to the result of the parse of the program/library.

Something like that.

I think this syntax will do:

import "foo";
import "foo" as "bar";
import "foo" search "@ORIGIN/../lib";
import "foo" as "bar" search "@ORIGIN/../lib";

The symbols of the library to be imported would be prefixed with "%s::", where the string is either the name of the library or the alias given in the import.

This looks pretty easy to pull off with the library system machinery in place.

wtlangford commented 10 years ago

This seems fine to me. At first glance, the only issue I see is circular dependencies. The nice thing about the -l library system is that circular dependencies actually work fine, you just need a -llib1 -llib2 -llib1. With this module system, we'll have to keep track of what modules/libraries have already been loaded.

wtlangford commented 10 years ago

We should hammer out some semantics/rules for the search path portion of this, though.

nicowilliams commented 10 years ago

I want ELF linker-style $ORIGIN semantics for sure. The alternative is to not have relocatable modules and programs.

EDIT: $ORIGIN, not @ORIGIN.

nicowilliams commented 10 years ago

Regarding circularity, there are two problems: infinite recursion, and code duplication. Since jq programs and libraries have no global state, code duplication is merely suboptimal, and I can live with that for now. The minimum we must do is notice circular deoendencies, and maybe not even: we can always document that they are not supported,

nicowilliams commented 10 years ago

@wtlangford To expand on $ORIGIN, any search path element in an import statement that starts with $ORIGIN/ should have "$ORIGIN/" replaced with the path to the directory containing the library where the import statement was found. So if a library is found in /opt/foo/bar/lib/foo.jq and the search path contains $ORIGIN/ then the path to be searched will be /opt/foo/bar/lib/.

This allows for relocation: if you relocate this package to /opt/foobar/ so that the lib path were now /opt/foobar/lib, the search for foo.jq's dependencies from the same package will still succeed.

Declared search paths should be searched before the system or JQ_LIBRARY_PATH directories. If we have this from day one then perhaps JQ_LIBRARY_PATH will not be abused. JQ_LIBRARY_PATH should only be used for running a jq executable outside its originally-intended install location.

pkoppstein commented 10 years ago

One question is whether the import statement should allow the specification of version constraints, or whether such constraints belong (exclusively) in the module metadata (JSON_OBJECT).

Another question is whether every module must give its version number. If so, then presumably JSON_OBJECT would be required, which is undesirable. Thus, the package management system will have to be able to manage unversioned modules.

ASSUMING that all the metadata about versions is going to be be placed in JSON_OBJECT, I would propose the following specification:

JSON_OBJECT is the repository for the module's metadata.
If JSON_OBJECT is given, then the following keys have special significance,
 and if given should have values as specified here:

"version": SEMANTIC_VERSION

"requires": ARRAY OF {"module": STRING, "version": SEMANTIC_VERSION_RANGE}

where:

SEMANTIC_VERSION is a string following the semantic versioning scheme;
SEMANTIC_VERSION_RANGE is either a SEMANTIC_VERSION (the minimum acceptable
version) or a string consisting of two tokens that together specify a range of acceptable
versions (see  http://julia.readthedocs.org/en/latest/manual/packages/#requirements)

Example:

{"version":  "1.2.3", "requires": [ { "module": "Statistics", "version": "0.1 0.2-"} ] }

[The above has been edited to indicate that JSON_OBJECT is optional, and that its special keys are also optional.]

nicowilliams commented 10 years ago

One step at a time. The only urgent decisions are: a) must modules declare a version, b) must dependents declare a minimum version for each dependency.

I'm inclined to integrate @wtlangford code as-is and revisit versioning later.

I'm also inclined to make versioning optional: jq is a friendly language with very little ceremony. Versioning is a best practice; making it required is not required.

But we do need versioning. One problem that comes up is: how to represent versions. We only need jq to enforce a minimum, and major version boundaries too.

My preferred version representation would be: as a number, with the integer portion representing a major version number and the fraction representing a minor version number. But the major number could just be made part of the module name, which makes sense if it represents a backwards incompatible change vis-a-vis the previous major version. Micro versions shouldn't be numbers, but numbers or strings (e.g., hash values, git commit hashes, ...).

pkoppstein commented 10 years ago

@nicowilliams wrote:

jq is a friendly language with very little ceremony.

Agreed, so I've revised the description of JSON_OBJECT to make everything optional, but I believe that in the interests of simplicity for the jq user, "registered packages" would be required to provide this kind of information.

The question remains, however, whether we want the "required version" information to be part of the metadata (JSON_OBJECT) or part of the import statement. It seems to me there are pros and cons either way.

Another question which I don't think has been addressed yet is whether (in the interests of minimal ceremony for jq users) the JSON_OBJECT should also be the locus of any information that may be required to ensure dependencies can be located without user intervention. The goal I have in mind is that the jq user should be able to add (using Pkg::add/1) and then import any registered module without having to specify anything about where that module or its dependencies are located.

nicowilliams commented 10 years ago

@pkoppstein For trusted (i.e., locally-found) modules there's nothing wrong with using $ORIGIN-based search paths from the module. For modules downloaded from the 'Net... well, we'll figure that out when we get there (ideally we'd have named repos and modules would be searched for in selected repos; no URIs in sight, but URNs yes, and if you want to use modules not in any repo then you'll have to make a local directory of said modules).

Adding syntax is always possible, and clearly we'll have to when we add versioning. I'll look at that as soon as we're done with the main part of the module system. If the metadata we need for the linkerloader is quite limited then I don't mind, and maybe prefer, having it not be an object (which we can always add later). At the moment I'm thinking that the only thing we really need version-wise is a minor version number (refreshing simplicity!).

nicowilliams commented 10 years ago

BTW, I don't relish the thought of adding the bloat of HTTP and TLS libraries to jq just to have a pkg system builtin. I realize that it would be oh so convenient. I'm tempted instead to rely on spawning a curl(1) process. I want to draw the line at regexp (which we now have, thanks to @wtlangford!) and maybe rudimentary Unicode support (Ongiguruma has some, but it's not exported, and it lacks normalization code). After that no new external dependencies; everything else in modules. We'll need a C-coded module system using dlopen(), LoadLibraryEx() (more on that some other day).

pkoppstein commented 10 years ago

@nicowilliams wrote:

I'm tempted instead to rely on spawning a curl(1) process.

Great minds! This is the code from Julia:

function curl(url::String, opts::Cmd=``)
    success(`curl --version`) || error("using the GitHub API requires having `curl` installed")
    out, proc = open(`curl -i -s -S $opts $url`,"r")
    head = readline(out)
    status = int(split(head,r"\s+",3)[2])
    for line in eachline(out)
         ismatch(r"^\s*$",line) || continue
        wait(proc); return status, readall(out)
    end
error("strangely formatted HTTP response")

end

(Specifically: base/pkg/github.jl)

I mention this for several reasons beyond the obvious. First, I hope you'll take the time to become more familiar with Julia -- it represents the combined effort of some great 21st century minds. Second, much of Julia is written in Julia, and I expect that with a few more primitives (notably system), jq's package manager could also be written primarily in jq. Third, Julia is MIT-licensed, so with the right incantations, we should be able to borrow freely.

nicowilliams commented 10 years ago

We now have import (still gotta document it). I'm thinking of adding syntax allowing modules to start with a module declaration:

module NAME version NUMBER;

Later we would add a metadata object option. "Later" because I have no use for such metadata now, but will eventually. The object would store arbitrary constant metadata. Might as well add a const def sort of thing as well: in the jq language what appear to be JSON object/array value literals are really code for constructing them, since they needn't be constant literals, but a constant literal could be useful. Also potentially useful would be a data load directive (imagine writing Unicode handling code in jq, thus needing to load large-but-constant Unicode tables).

pkoppstein commented 10 years ago

@nicowilliams wrote:

module NAME version NUMBER;

Excellent, but in accordance with your previous observations about friendliness, I assume you mean:

module NAME [version VERSION];

Also, as a supporter of semantic versioning, and in the spirit of "convention over configuration", I'd recommend that VERSION be required to conform to the semver syntax. It will simplify things down the road.

If I understand http://semver.org/ correctly, the syntax of a valid semantic version number can be summarized as follow:

VERSION == NORMAL or VARIANT

NORMAL == NUMBER "." NUMBER "." NUMBER

NUMBER == 0 or [1-9][0-9]*

VARIANT == NORMAL "-" IDS

IDS == ID or ID "." IDS

ID == [A-Za-z1-9][A-Za-z0-9-]*

Examples: 1.2.3 1.2.3-alpha 1.0.0-0.3.7 1.0.0-x.7.z.92

However, I could see allowing NUMBER and NUMBER "." NUMBER as well.

wtlangford commented 10 years ago

@nicowilliams wrote:

in the jq language what appear to be JSON object/array value literals are really code for constructing them, since they needn't be constant literals

This intrigues me, as saying "ABC" in jq actually creates a function (named @lambda) that adds "" and "ABC" and returns it. Was there a reason for this behavior? I imagine it has something to do with backtracking and the creation of closures, but I cannot for the life of me figure out what.

nicowilliams commented 10 years ago

@wtlangford I have noticed this too. This has to do with the way string interpolation/formatting works. It should be possible to optimize this away in gen_binop() though. Might as well add some compiler constant folding functionality while I'm at it.