Roots shouldn't be absolute identifiers

dsheets commented 9 years ago

Currently, DocOckResolve.build_resolver assumes that the polymorphic root type 'a is an absolute identifier.

dsheets/codoc@31cc502e6d6f678705b5f949ed102b9432e19c38 worked around this assumption which manifested as (transitive) roots from dependencies being re-used in the dependent without a second lookup issued. If the value of a root is dependent on the module doing resolution (e.g. when roots contain relative locations), the assumption breaks down and the incorrect root is resolved.

Either indexing resolved roots by both resolving module and module being resolved or changing the signature of build_resolver to

val build_resolver: ('a Unit.t -> string -> ('a * 'a Unit.t) option) -> 'a resolver

should work. I favor changing the signature as I don't see any benefit to having the doc-ock-lib resolver maintain root cache state on my behalf. The root value in the returned 'a Unit.t should not be used as the resolved root because this root value will be relative to the resolved unit and not the resolving unit.

What do you think? Do roots have to be absolute?

dsheets commented 9 years ago

I should note that this design has a risk of transitivity violation but that can be overcome by comparison of 'a Unit.t values (which, in many applications, will be memory equivalent). resolve could then return resolver discrepancies as well as a fully resolved unit.

dsheets commented 9 years ago

I think this is the most critical issue currently (superseding #35) because it blocks a phase-separated command-line interface (dsheets/codoc#29) (requires lots of re-reading and indexing to generate links for HTML from a resolved XML file).

I am willing to contribute a patch to fix this, if you'd like (and then you can keep working on -pack support).

lpw25 commented 9 years ago

I think that roots need to be in some way absolute because they are absolute in OCaml. Resolving is equivalent to linking and when linking modules you need to have a unique identity for each module that is agreed upon by all the modules.

In particular, I don't want to force people to use a separate resolver for each module being resolved. The resolvers are designed to be used across a collection of modules, and this requires the results of fetch to be consistent. Resolving is an expensive process, and without memoization we are resolving at least N^2 times.

I originally allowed the equality on roots to be given by the user, but it seemed to add complexity for no real gain. If we put back support for that, then you should be able to use roots with additional relative information correctly.

(As a side-note, in case this was the source of the problem you were having, the correct value of 'a Unit.t to the resolving function is the unit in whose source file the string is found, so it is not always the source file we happen to currently be resolving. It is also not always 100% correct at the moment, but I will fix that at some point.)

lpw25 commented 9 years ago

val build_resolver: ('a Unit.t -> string -> ('a * 'a Unit.t) option) -> 'a resolver

Note that lookup and fetch are split because lookup is cheap but fetch probably requires reading an XML file. There are times when we only need to know what the unit is called without needing to actually fetch its contents -- so this can improve performance.

lpw25 commented 9 years ago

I've pushed support for custom equality and hash on roots during resolving. It is hard to test it properly until you start using it, so there may be some bugs.

Does that resolve this issue to your satisfaction? It should allow you to keep your relative information, like providence, in the root whilst still supporting the memoization used by the resolver.

dsheets commented 9 years ago

I don't think custom equality and hash solves the issue because roots from transitive dependents will still be reused for their dependencies which have a different location.

Regarding your reiteration of the issue, I have a number of clarifications:

I don't understand "Resolving is an expensive process, and without memoization we are resolving at least N^2 times." Do you mean fetching? Lookup? Which is expensive and where does the N^2 come from? Every user * every dependent?
How can I use differing roots using equality control? As far as I can tell, equality control lets me describe when two roots point to the same unit but not when two roots differ by their identity but not their referent.
I don't understand why the argument to the resolving function is the location of the identifier rather than the location of the originally requesting module. When would I use this information? If the root being resolved will be used in the originally requesting module, I'd like to know this information to produce the correct root for that context.
Both lookup and fetch are cheap for me because I cache everything anyway. See https://github.com/dsheets/codoc/blob/master/cli/codocCliDoc.ml#L46. I do this because the root itself does not contain enough information to load the unit. The root does not contain enough information to load the unit because the root cannot be relative and I don't want to produce location-dependent (absolutely referenced) XML files.
The link phase is the only phase when I actually need to have global knowledge. After the link phase, all of the information discovered during linking should be extractable from the XML files. Unfortunately, the roots I'm currently using are useless for loading referenced files. If roots must be everywhere identical (why? I can supply a comparator), the only solution for storing linking information in them is to use some sort of absolute identifier. If I use a key, I must then later rebuild the index. If I use a path, it must be absolute. If the path is absolute, these files are not portable or useful on the web.

Keeping the fetch/lookup separation seems OK to me. I don't understand why doc-ock-lib currently maintains a cache. I'd like to receive a lookup with the module-having-its-references-resolved as an argument for every different module. If I get a lookup for every reference, that is OK. The root I return will be unique for that module and name-being-resolved. Then, I'd like to receive a fetch for a particular module-having-its-references-resolved as an argument and the root being resolved. If I get a fetch for every reference, that is OK. I will then return a unit that corresponds to the requested module.

Does the issue make sense? Have I missed something? I'd like the values stored in roots to be useful for later readers who don't have global knowledge.

lpw25 commented 9 years ago

I don't understand "Resolving is an expensive process, and without memoization we are resolving at least N^2 times." Do you mean fetching? Lookup? Which is expensive and where does the N^2 come from? Every user * every dependent?

Sorry, I was very imprecise there. The expensive process is turning an 'a Unit.t into something on which we can perform lookup. For this reason it is done lazily and cached. If it is not cached between resolving different modules, then rather than convert each module at most once, we convert it for every module that refers to it.

How can I use differing roots using equality control? As far as I can tell, equality control lets me describe when two roots point to the same unit but not when two roots differ by their identity but not their referent.

As I said, we need to have an absolute name for every module when doing resolution. By using a custom equality operation you can have some non-absolute information (e.g. providence) also stored in the root -- it would just be ignored when checking for equality.

Essentially, each module needs a unique name, without custom equality the root must actually be the unique name, with custom equality the root only needs to contain the unique name.

I don't understand why the argument to the resolving function is the location of the identifier rather than the location of the originally requesting module. When would I use this information? If the root being resolved will be used in the originally requesting module, I'd like to know this information to produce the correct root for that context.

Because that is the context in which the name should be looked up. If I write Foo in "bar.ml" it is in the context of "bar.ml" that the name Foo should be considered -- not in the context of some other module that happens to refer to Bar. When a call to lookup is made, the resolver is asking "what module did X refer to in the source file of Y?". This allows you to use digests and heuristics to correctly work out which module was being referred to.

I think the key point (which I should have mentioned earlier, but I hadn't really appreciated its importance) is that the 'a Unit.t returned by resolution will only include roots which came from its source code. Whilst resolution often needs to resolve names from the source of other files, this is only for "typechecking" things, it is not for inclusion in the output. So lookup can safely always return the root which would be correct within the module that the string comes from, because it will only be included in the output if that is the same as the module that is being resolved.

There may be some cases around substitution where this breaks down (because of the parts of that which are broken and need fixing), but anywhere where it does break down should be considered a bug.

The roots from different modules do need to be mutually comparable for equality and hashing, which until now forced them to be structurally equivalent, but now they just need to be comparable.

lpw25 commented 9 years ago

I don't understand why doc-ock-lib currently maintains a cache.

In case this wasn't clear from the above answer, it is because only doc-ock-lib has access to the data that should be cached. We do not cache 'a Unit.ts but the results of converting an 'a Unit.t into something suitable for performing lookup.

dsheets commented 9 years ago

Ok. That clears up some confusion. I understand now the state maintained is more like an indexed symbol table.

It is very good to know that roots won't get included in non-contextual modules. This wasn't clear at all.

I understand now that relaxed equivalence is really sufficient to have relative roots. Thanks for your help!

ocaml-doc / doc-ock

Roots shouldn't be absolute identifiers #44