haskell / haddock

Haskell Documentation Tool
www.haskell.org/haddock/
BSD 2-Clause "Simplified" License
361 stars 241 forks source link

Complexity annotations #91

Open ghc-mirror opened 10 years ago

ghc-mirror commented 10 years ago

Original reporter: pumpkingod@

It would be nice to be able to annotate certain functions with their time (maybe even space, if appropriate) complexity, and to have it appear in a distinctive place in the generated HTML. Currently some libraries write it at the beginning of the textual documentation of each function, which works, but it would be nice to have a more structured approach to this.

theobat commented 5 years ago

How hard would it be to do this ?

Lysxia commented 5 years ago

The hard part is fleshing out the design and explaining what it would be useful for. It seems documenting a convention to be adopted by the community about formatting items as common as computational complexity might achieve a satisfactory result with little to no change to haddock the tool itself. But for whom is this really worth standardizing?

harpocrates commented 5 years ago

Actually documenting a convention sounds pretty good. Then we could make sure that boot libs at least are all consistent (I think that containers uses latex while bytestring uses italics).

But I agree: additional markup is probably not worth it.

Sent from my iPhone

On Jan 31, 2019, at 8:33 AM, Xia Li-yao notifications@github.com wrote:

The hard part is fleshing out the design and explaining what it would be useful for. It seems documenting a convention to be adopted by the community about formatting items as common as computational complexity might achieve a satisfactory result with little to no change to haddock the tool itself. But for whom is this really worth standardizing?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or mute the thread.

theobat commented 5 years ago

I suppose we could make interesting things with haskell ide engine if time/space complexity annotations were standardized (such as parsing them and combining them to generate some of them automatically for instance, but this is probably harder than I think it is). I guess you're right that if people don't write time/space complexity annotations in their comments now, they won't magically do it with a standard structure... Still, having a way to "use" it in haskell tools should be interesting.

Lysxia commented 5 years ago

Actually I think if it's in the documentation, more people will start doing it. Often if something potentially good is not done, it's just that people don't know they can do it in the first place.

theobat commented 5 years ago

Here's my proposal then, at this point it's just a very sketchy draft:

General idea

The proposal is to standardize a way to document asmptotic time/space complexity and "exact" time/space complexity. It is targeted at both people and machines, in order to get nice haskell-id-engine support, and nice and readable haddock documentation. I don't know of any existent work in trying to formalize this kind of syntax, thus any pointer to prior attempts would be appreciated.

The existent

Some haskell packages have informal asymptotic time complexity in their haddock comments, which is nice, for instance: http://hackage.haskell.org/package/containers-0.6.0.1/docs/Data-Set.html http://hackage.haskell.org/package/containers-0.6.0.1/docs/Data-Map-Strict.html

We can notice that almost all of the complexities are implicitely referring to the length of the Set or Map argument by the letter n. Sometimes, when two structures are used, the letter m is also involved.

What's missing then ?

The complexity annotation's parameters are implicit. What's this n there ? why it's not something else ? What's the relation between n and the function arguments. The asymptotic complexity is important, but constant factors are sometimes even more important, what about the possibility to document tilde notation (~) or even the entire function ? The other thing is, in case we have differenty possibilities for the complexity (different computation path) then we have to reason about min/max and average complexity, which is very confusing with an informal notation.

What do you propose ?

Syntax:

Every Time Complexity annotation begins with TC (respectively MC for memory complexity). The line identifier belongs to the following list: (TC|TCavg|TCmin|TCmax) the TC expression means TC = TCavg = TCmin = TCmax.

The letter arg1 in the expression n=size arg1 is a reference to the first argument of the function anyNlogNFunction of type Set a, and thus we can reference any argument of the function in this list (arg2 is the second argument etc). The size function is the actual haskell size function returning the size of a.

Essentially, the TCmin line is a comma separated value line beginning with the actual complexity expression and any number of variable definition. The actual complexity expression may use any mathematical expression (a calculator syntax, but it should be specified too) with any of the variables declared in the following "columns". The variable declaration part consists of a single letter name (anything in the latin alphabet, lower or upper case) on the left side of the equal and any haskell expression returning an Int on the right part of the equal sign. The right part may also refer to the implicitely defined argN, which are the arguments of the underlying function.

Examples:

-- | Old syntax:
-- | /O(n)/. Convert the set to an ascending list of elements. Subject to list fusion.
-- | New syntax:
-- | Convert the set to an ascending list of elements. Subject to list fusion.
-- | TC O(n), n=size arg1
toAscList  ::  Set  a  -> [a]
toAscList = foldr (:)  []

Why bother ?

Because having a sense of what you pay for when you chose to go this route in your algorithm instead of this other should be accessible, we'd make better decisions, and the reasoning would be automatically documenteed to some extent. Moreover, having a common tightly cohesive language to achieve this would be higly beneficial to anyone who wants to do it properly. Base libraries are determinant here, and widely used libraries too. Finally, thinking about the complexity of your algorithm is fun if it's accessible (read: if you have access to the complexity of all the functions you use), it'd be like doing mathematics knowing all the definitions and theorems thus removing all the indirection.

Why it's better ?

Because it's much more precise and you can now parse the complexity expressions and combine them, for instance:

-- | Convert the set to an ascending list of elements. Subject to list fusion.
-- | TC ~(n), n=size arg1
toAscList  ::  Set  a  -> [a]

-- | Convert the set to an ascending list of elements. Subject to list fusion.
-- | TC ~(n * m), n=size arg2, m=TC arg1
map  ::  (a -> b) -> [a] -> [a]

-- Then at some point we could infer that:
test :: Set  a  -> [a]
test = (map identity) . toAscList
-- | has a complexity of 
-- | TC ~(n + (n * 1)), n=size arg1
-- | TC ~(2n), n=size arg1
-- | The problem here is that we use the fact that size of the set is equal to the size of the list... And it's not logically usable at this point. 

Of course this is way harder than this example looks like, especially if we try on more complicated examples. I'm not sure this part is realistic at this point, it's a few ideas just thrown at the problem, it seems like refinement types could be very useful to get a better perspective at the automatic complexity inference problem.

hvr commented 5 years ago

@theobat what I'm missing from your suggestion is to take into account mathtex; see e.g. https://hackage.haskell.org/package/text-containers-0.1.0.0/docs/Data-TextSet-Unboxed.html#v:lookupGE which makes use of this to typeset the complexity annotations with the proper TeX fonts for big-O-notations as well as the formulas, and I'm not willing to give up on this ;-)

PS: or see also the space-complexity described at https://hackage.haskell.org/package/text-containers-0.1.0.0/docs/Data-TextSet-Unboxed.html#t:TextSet for a less trivial formula markup

PS2: you should also follow the convention started by @since to require the annotations to be a separate paragraph; in your examples the TC annotations where all in the same paragraphs as the prose text

theobat commented 5 years ago

Right, it makes perfect sense, I just overlooked it, I'm going to take it into account and modify the proposal.

PS: this seems harder than I thought though, the latex syntax is great for humans but way harder to "evaluate" automatically. I'll look into what can be done in this regard.

ag-eitilt commented 5 years ago

This is dipping into the tooling side, but the "any haskell expression returning an Int" needs a way of referring to other complexity annotations, for versioning: if one of your dependencies changes its backend so that some complexity goes from O(n) to O(log n), say, it's a lot better to be able to simply say "my complexity is O(m^2) no matter what version of m is being used". That's going to be especially important with class parameters -- or Backpack -- meaning that your complexity might differ within the same program just because of the type of an argument. How that's actually going to be displayed is another question, but especially for Haddock I'd say just go with printing the straight list of definition equations, linked; other consumers can deal with version ranges and cross-references if they want to.

The other thing I'd recommend, when you're writing the @-markup version of the proposal, would be separating what is being analyzed from which case is indicated. Namely, rather than four different keywords, I'd only reserve @time and @space (or even @complexity time or similar to allow non-standard measurements) and have min/max (average being default) written separately just before the O(...): @time O(n), max O(n^2). It might be nice to incorporate that into the annotation itself (e.g. max O(n^2) -> o(n^2)), but that runs into issues of writing theta and omega in an ASCII-compatible manner.

Finally, with this being Haskell, it might be better to go with a different separator between the complexity annotation(s) and the variable definition(s) rather than reusing the same list separator for different types. @time (n = size arg0, m = O(lookup arg1)) => O(nm), min O(n) maybe?

theobat commented 5 years ago

hey @ag-eitilt thanks for the feedback ! You have very good points. This proposal is really wishful thinking at this point though, using the mathtex language as a mean to convey information between functions is way harder than I thought (especially the "simplification process", it boils down to somehting like Mapple/Matlab), besides there is some philosophical level blur in this proposal, as you said, "This is dipping into the tooling side" and I'm not sure this is haddock's reponsibility anymore... I mean there could be/should be a cross language syntax for this, maybe in the Language server protocol ?