jqlang / jq

Command-line JSON processor
https://jqlang.github.io/jq/
Other
30.51k stars 1.58k forks source link

Library System #472

Closed wtlangford closed 10 years ago

wtlangford commented 10 years ago

Similar to https://github.com/stedolan/jq/issues/112#issuecomment-31241424, simply looking to hammer out a spec of some sort for documentation's sake.

As for the semantics of -l, they are up for discussion. It has been discussed that something like -l/path/to/file.jq should be treated as explicit paths and -lfile should be treated as something to be searched for in the search paths. I agree in this respect.

As for the order of the search paths, I think it should be something like this:

  1. Paths passed in with -L
  2. JQ_LIBRARY_PATH
  3. System locations (we don't currently use this, but might later.)

Library files are expected to not contain a main expression, only function definitions. If one is found, then an error should be thrown and jq should exit.

Thoughts?

nicowilliams commented 10 years ago

@wtlangford I agree with all of that.

Note that 8780bc0 fixed #138, so that distinguishing libraries from programs is now (EDIT: now, not not) reliable and easy.

wtlangford commented 10 years ago

Hopefully that should read "now" and not "not". But yes, that seems good. I'm currently calling through to jq_parse_library() for each one, as it seems to properly load things into a block and test that the code is, in fact, a library, before binding it.

Also worth considering is how we want -l and -L interacting with stackable options. We can't look to gcc for inspiration, as gcc doesn't stack its options. I don't think these should be allowed to stack.

More semantics for -l and -L: is the space required or optional? gcc makes it optional, at the cost of not allowing any other flags to start with l. If the space is optional, special handling will be required for these options. If we make it required, they plug nicely into our current system.

Lastly, for -lmath, should jq look for libmath.jq or math.jq?

pkoppstein commented 10 years ago

@wtlangford wrote:

As for the order of the search paths, I think it should be something like this:

  • Paths passed in with -L
  • JQ_LIBRARY_PATH
  • System locations (we don't currently use this, but might later.)

I would suggest that to avoid complexity (especially from a user's point of view), we avoid JQ_LIBRARY_PATH and system locations. Also, I'd like to continue plugging for a view of libraries and modules as resources that may be available in text files or via URLs, etc.

wtlangford commented 10 years ago

@pkoppstein The nice thing about JQ_LIBRARY_PATH and system locations is that if the user doesn't know anything about them, not much is likely to happen. JQ_LIBRARY_PATH won't exist and we just won't look there. System locations we don't even use right now, and in fact I'm not actually writing any code for them.

As far as resources via URLs, my only concern would be dependency creep, though URL-loaded resources could be handled by dlopen()ing cURL if necessary or just autotools configure macros enabling/disabling cURL at configure/build time.

Not sure what you mean by text files. Could you clarify?

wtlangford commented 10 years ago

Looking at it now, it is actually fairly simple to support both -lmath and -l math. I think I'll just do this. It is likely expected anyways, since we're borrowing gcc's syntax for this.

pkoppstein commented 10 years ago

@wtlangford asked:

Not sure what you mean by text files

I just meant ordinary files of the kind expected by the -f option. I was just anticipating that in future, "libraries" might be provided in other formats.

nicowilliams commented 10 years ago

@wtlangford Yeah, "now", not "not".

As for -lfoo vs -l foo, it could be either, even both. Whatever is easiest for you, since you're doing this work. However, I do like the familiarity of -lfoo. Obviously options with an argument can't stack, so that's not a concern.

Let's stay away from URIs for now but leave room for them: if the lib name/path starts with a '/' it's a path, else if there's a ':' in the lib name/path then it's a URI (and not supported for now), in all other cases it's a file searched for in the search path. In the module system each dependent will get to declare relative search paths -- I abhor *_PATH env vars, so I want a reasonable way to avoid them altogether, but it's OK to have *_PATH env vars given that. Since we have to build the library system first, then the module system, the env var nastiness gets in the door first; c'est la vie.

nicowilliams commented 10 years ago

@wtlangford Yes, in the future I'd also like to have shared objects / DLLs to support the addition of C-coded builtins. First things first though.

nicowilliams commented 10 years ago

Pure jq libraries should be in files whose names end in .jq.

wtlangford commented 10 years ago

@nicowilliams Fair warning- once the *_PATH variable exists, it is unpleasant trying to remove it. (People begin to depend on it.) If we really don't want it to exist, we should consider never having it.

@pkoppstein My focus on the library system developed from when I was working on adding the ability to accept multiple -f files. I was expecting that all of them (or all but the last) acted like libraries, which makes sense, since we can only have one main expression anyways. It seemed that allowing -f multiple times could give it different meanings in different contexts and that seemed...nasty. So, I pushed for the library system to handle multiple -f instead.

@nicowilliams Is there an alternative to pure jq libraries? Or is that opposed to the future c-coded builtins? Also, I'm currently using the following logic for identifying paths instead of names: if libname starts with ., /, or ~, it is a path. Perhaps I should also check to see if it contains /, in case someone uses relative paths...

pkoppstein commented 10 years ago

@nicowilliams wrote:

if the lib name/path starts with a '/' it's a path, else if there's a ':' in the lib name/path then it's a URI

Can we refine that a little to conform more closely with the URI specification? A URI starts with "scheme" followed immediately by ":", so I'd suggest:

If the string begins with "scheme" followed by ":" then it's a URI, otherwise it's a path or relative path.

My understanding is that "scheme" is defined as [a-zA-Z][a-zA-Z0-9+.-]*

once the *_PATH variable exists, it is unpleasant trying to remove it.

Yes, in my view that's yet another reason to postpone supporting JQ_LIBRARY_PATH until it is shown (proven?) to be essential.

wtlangford commented 10 years ago

@nicowilliams @pkoppstein I've pushed a branch with library support to my fork. http://github.com/wtlangford/jq library-system Take a look and play with it, tell me how you feel about it. It currently supports JQ_LIBRARY_PATH, which we may agree to just remove anyways. Of note from my testing is that the order of -larguments closely mimics the way it works with gcc. If A depends on B, you must write jq -lA -lB, as jq -lB -lA does not work.

nicowilliams commented 10 years ago

@pkoppstein Sure, but in practice what I proposed works: just document the scheme thing and if anyone uses -lfoo/bar:baz they'll get what they deserve some day.

@wtlangford My objection to *_PATH is mostly that programs and libraries should declare the locations of their dependencies in a way that is relative to: the path of the program/library, to directories in the built-in search path / system locations. Given that I'm OK with having *_PATH for special cases.

The problem is that without a way to build relocatable, er, objects, people resort to writing wrapper scripts with enormous search paths stored in env vars -- that's not acceptable, and the way to avoid it is to have a way to do what ELF does with @ORIGIN in rpaths. But I'm willing to go one step at a time and hold off shipping 1.5 until we have both, the library system and the module system.

EDIT: Close italics.

nicowilliams commented 10 years ago

@wtlangford Keep JQ_LIBRARY_PATH for now. Let's make sure that we have syntax for declaring rpaths later, and better: imports, exports, and namespace prefixes.

(Speaking of which, using '::' to separate module name from function name will be tricky, since ':' is parsed as a token by the lexer. Sure, we could make it work, it'll just be annoying :(

pkoppstein commented 10 years ago

@wtlangford wrote:

I've pushed a branch with library support ...

Yay! My first little test worked. On a Mac. This obsoletes the "multiple -f" enhancement request, right?

nicowilliams commented 10 years ago

@pkoppstein Yes, that one goes away.

wtlangford commented 10 years ago

I'm about to push some edge-case handling. Also, glad to hear it worked on a Mac. Sometimes stat() isn't as portable as it should be.

wtlangford commented 10 years ago

Just pushed f57a14087f9adf7279a34fc5cba96a09376062a5, which has some fixes for when $HOME isn't set, among other things.

nicowilliams commented 10 years ago

The link to your branch (to make it easy to navigate to): https://github.com/wtlangford/jq/commits/library-system

nicowilliams commented 10 years ago

@wtlangford BTW, when this is done I think I'll add support for builtin libraries, so that a raft of contributed utils that maybe shouldn't be in builtin.c today can be added to a -lutil library. With builtin libraries we still get a standalone jq program.

nicowilliams commented 10 years ago

@wtlangford FYI there's a bug in master regarding ~/.jq:

$ cat > ~/.jq <<EOF
def f: "foo";
def g: "bar";
def fg: foo+bar;
EOF
$ ./jq -n fg
jq: locfile.c:37: locfile_get_line: Assertion `pos < l->length' failed.
$

This also affects your branch, though it's not your fault. I've a feeling that we need to refactor the locfile bits a bit so that the blocks reference the correct sources and we free the locfiles at the end.

nicowilliams commented 10 years ago

Ah.., we assert while trying to complain that f/0 is unknown. But f is defined in the same file. Right, but that's not how builtin.c handles each jq-coded builtin -- builtins_bind() compiles each builtin separately, rather than the whole library of them at once. So there's several problems here...

nicowilliams commented 10 years ago

I think the thing to do is to take the blocks output by jq_parse_library() and merge them into the blocks output by jq_parse(), and we need to extend the locfile infrastructure so that a single object has all the source and location information for all the libraries and the main program. Then we cna stop parsing builtins one-by-one.

wtlangford commented 10 years ago

Perhaps not unsurprisingly, in my branch:

#cat > ~/.jq <<EOF_
def f: "foo";
def g: "bar";
def fg: f+g;
EOF_
$ ./jq -l~/.jq -n fg

works fine. This makes sense, given that the second time it parses fg/0 it actually finds f/0 and g/0, since they showed up from the first time ~/.jq was parsed.

But I agree that this needs to be reworked. If I had a better understanding of the compiler and friends, I might swing at it at bit.

As for the branch itself, do we see any issues? If not, I'll update the docs. Also, a note about naming conventions for jq libs. Should -lutil look for util.jq or libutil.jq?

pkoppstein commented 10 years ago

@wtlangford asked:

Should -lutil look for util.jq or libutil.jq?

These are (at least for now) just ordinary text files, so I think it would be confusing in several respects to adopt the "lib" convention, which really belongs to another universe, and maybe even another century.

In fact, I was thinking that before we go too far down this well-trodden path, it would be good to see how all this fits in with what I believe should be the primary mechanism for using jq libraries -- the "import" or "require" directive (in the manner of ruby or python). It would, at the minimum, allow for a chain of dependencies to be resolved automatically and efficiently. If "require" were properly supported, would there be any compelling need for the "-l" command-line option as currently envisioned?

nicowilliams commented 10 years ago

@wtlangford

That only works by accident because you're loading ~/.jq twice! Try ./jq -n fg with the same ~/.jq and watch it die.

wtlangford commented 10 years ago

@nicowilliams right. That was actually my point. The second time it parses ~/.jq, it looks at the previous parsing to find the things it wouldn't find the first time (since it was in the same locfile). I was simply sharing an anecdote, there.

nicowilliams commented 10 years ago

@wtlangford Sorry, I hadn't fully read your comment (shame on me).

Your branch looks fine to me now.

As for the bug here... I think there's two things to do:

nicowilliams commented 10 years ago

Ah, yes, definitely use block_bind(), because block_bind_referenced() frees the binder as it goes, but we may need the same library for binding multiple libraries.

Aha, later when we add a way to express dependencies we'll need to find the dependency there, parse it if we haven't already, and bind it to the dependent, and we may need to do this many times, so we might as well keep all the loaded libraries' parsed forms around for binding to each other and to the main program.

Of course, that still doesn't explain why the fg in the library didn't bind to the f...

nicowilliams commented 10 years ago

I think i just figured it out, but won't get to it until tomorrow. We should apply block_bind_referenced() to link libraries to main programs, but we need to bind libraries to themselves too, which that function can't quite do. What that function does is throw away the blocks for any unreferenced defs, see...

nicowilliams commented 10 years ago

@wtlangford FYI, I fixed that problem with 6b6e3f4 (see #479).

nicowilliams commented 10 years ago

@wtlangford FYI, I've got a commit styling yours a bit in https://github.com/nicowilliams/jq/tree/libsys .

wtlangford commented 10 years ago

@nicowilliams I like it. Good catch on the Windows path separator. I hadn't yet tested it in a windows environment.

wtlangford commented 10 years ago

Shall I merge your style commit into my branch and do a bit more testing now that #479 is fixed?

wtlangford commented 10 years ago

Rather, after doing those things, do we feel this is ready for a PR?

nicowilliams commented 10 years ago

Merge my commit, rebase and squash, play with it, and send me a PR.

If you like you could do two more things: a) update the docs, b) add a simple test to tests/run (just like I just did for #479). If you want to leave (b) for me, that's fine.

nicowilliams commented 10 years ago

@wtlangford Question: supposing we could quickly turn around a module system, would we still want the -l option?

wtlangford commented 10 years ago

I was actually considering that, myself. The module system -does- remove the need for a -l option. I'm fine with just quickly turning around the module system.

nicowilliams commented 10 years ago

@wtlangford Yeah, i agree. I can push a half-baked start at modules to a branch.

On Sunday, July 13, 2014, William Langford notifications@github.com wrote:

I was actually considering that, myself. The module system -does- remove the need for a -l option. I'm fine with just quickly turning around the module system.

— Reply to this email directly or view it on GitHub https://github.com/stedolan/jq/issues/472#issuecomment-48858014.

wtlangford commented 10 years ago

@nicowilliams Up to you. If you just want to push some changes to the parser to allow the import statements, I should be able to handle the rest.

nicowilliams commented 10 years ago

@wtlangford See the libsys branch of https://github.com/nicowilliams/jq i.e., https://github.com/nicowilliams/jq/tree/libsys .

It's missing: gen_import() and the code to make it happen in jq_parse().

nicowilliams commented 10 years ago

@wtlangford Did a bit more work on it; see my libsys branch. Nothing yet running, just code. But I think it's getting closer. parser.y in particular is in shape.

Note: we need to pass argv[0]'s origin to the jq compiler, so we can resolve $ORIGIN references in the main program. I'm not sure what's the best way. The jq_compile() functions are getting to have a lot of arguments. I think I'd rather have a setter function to set the origin (and JQ_LIBRARY_PATH, and ...) on a `jq_state `. Extensibility this way is nicer IMO.

wtlangford commented 10 years ago

@nicowilliams I agree about adding that information to the jq_state. More than 6 arguments is too many, and 6 is already a lot. Interestingly, I've been working on this without really looking at what you have and our branches are developing similarly. I've written a gen_import and started on processing the dependencies. I'll get something together and push a branch for you to see as well.

wtlangford commented 10 years ago

On that note, do we want a finite list of valid attributes to set for the jq_state or do we want to make it generic and back it with a jv_object or something?

nicowilliams commented 10 years ago

@wtlangford I'll back-off then, let you make progress. I got started on fixing bison warnings and that led me into this.

As to attributes on jq_state, yeah, a jv_object for all attributes whose values can be represented as jvs sounds like an excellent idea.

Note BTW that at the lower level we need to pass explicit origins around: the origin might be a library's rather than the main program's.

wtlangford commented 10 years ago

I've got some ideas bouncing around for how to handle origins. On the topic of search paths, I'm thinking we leave -L in place just to avoid something like

import lib1 search "/path/to/jq/libs";
import lib2 search "/path/to/jq/libs";
import lib3 search "/path/to/jq/libs";
...

All that repetition hurts to look at. Also if things get moved around, that's a lot of editing. Search order'd be something like

  1. specified path in import statement
  2. -L paths
  3. LD_LIBRARY_PATH Thoughts?
nicowilliams commented 10 years ago

I agree, most import ... search ... statements should only ever specify $ORIGIN/-relative paths; all other libraries should be found relative to the jq executable's builtin path or $JQ_LIBRARY_PATH or -L. We need at least one of -L or $JQ_LIBRARY_PATH (did you really mean LD, or was that a typo?); both might be OK, but since the env var can always be overridden with each invocation, I think then that -L should add to $JQ_LIBRARY_PATH.

Finally, the builtin path in the jq executable should be $ORIGIN/-relative -- with $ORIGIN being the jq executable's dirname(argv[0]).

EDIT: s/type/typo/

nicowilliams commented 10 years ago

Hopefully by having $ORIGIN/ support from the get-go and a built-in path with $ORIGIN/ (not just, or even instead of --libdir) we can avoid the need for anyone to use $JQ_LIBRARY_PATH. That's the idea anyways. That which ELF folks arrived at after many years, we can avoid learning the hard way.

nicowilliams commented 10 years ago

One more possibility: remove both, -L and $JQ_LIBRARY_PATH and add them only once we know that $ORIGIN/-relativity is not enough :)

wtlangford commented 10 years ago

I did mean JQ, LD was from muscle-memory (which, I guess, is kind of the point). I think I'd like to have one of -L and JQ_LIBRARY_PATH there, just for simplicity's sake. Especially if we're going to start bundling any libs with jq to keep them out of the builtins, we'll need a system path and a way to override that for non-standard installs. (./configure --prefix= I'm looking at you.)