numbers / numbers.js

Advanced Mathematics Library for Node.js and JavaScript
Apache License 2.0
1.76k stars 167 forks source link

API #80

Closed ethanresnick closed 11 years ago

ethanresnick commented 11 years ago

I've been thinking a lot lately about what the API should look like, which seems important to nail down before we add too much more code (things'll just get harder to change and the API's key for marketing/usability). Lets use this issue to try to figure out the overarching principles.

Right now we have two conflicting modes of operation going on:

  1. Helper functions organized by mathematical domain that operate on and return native data types
  2. Custom objects that usually return a modified instance of themselves for chaining

If we're moving in the object direction, I think we need to figure out how we're going to keep that feeling light and usable—cause it can get clunky really fast.

A couple principles came up from the last discussion:

  1. Having factory methods on numbers that make it easy to get from built-ins to our custom objects at the start of the chain. I.e.: numbers.createMatrix(arr).matrixMethod().

    Instead of the createX naming, we could also do newX or makeX, but, please, lets not end up with numbers.matrix(arr)—there should be something in the method name that implies that an object is being created.

  2. All object methods that take another custom object as an argument (e.g. Vector getting the distance between another Vector) should also take a built-in representation of that other object wherever possible. @milroc suggested this and I totally agree.

    (One thing to consider here is whether passing in a native structure rather than one of our objects should ever cause the return type to also be a built-in rather than one of our custom objects. Though this may be a moot point if we have our objects extend from the built-ins...see below.)

  3. We should also shorten some of our method names in general (along the lines of what @revivek did in https://github.com/sjkaliski/numbers.js/pull/57)

Combining the first, then, calculating the distance between two vectors would look like: numbers.createVector([0,1,3,4]).distanceFrom([3,5,6,7]),

as opposed to the current:

(new numbers.linear.Vector([0,1,3,4])).distanceFrom(new numbers.linear.Vector([3,5,6,7]));

Then I've also been thinking about a couple other principles.

  1. Maybe our data types could extend the built-in objects. So Matrix would extend Array, for instance, and you could do things like numbers.createMatrix(arr).transpose()[0,0], because the Matrix returned by .transpose() would also be an Array.* This could be super convenient. Most significantly, it would let our data structures interact with other libraries that expect native arrays.

    The downside is that it requires giving up some encapsulation. For instance, the idea of caching the length property in rowCount goes away because the data can now be updated without the object knowing about it. Similarly, if we had a Set object you could imagine caching the mean/average in an instance property to speed up a lot of calculations, but allowing direct access to this data would make it impossible to automatically know when to invalidate this cache. (Note that in both of these examples the data could really have been updated directly anyway, but at least in the non-array approach you'd have had to go through .data, which could easily be documented as internal or even renamed .__data. )

    One option would be to say in the docs that the underlying Array methods should only be used if they don't transform the underlying data. That would still leave some utility in the native Array interface (e.g. direct access to an element like in the transpose example, but also access to a row or set of rows with .slice, the ability to loop over rows in implementations that support .forEach, etc).

    Another option would be to create a naming convention for a method that recalculates any internal properties, i.e. we could say "go crazy with the native array interface, setting things, deleting them, adding them, whatever, and then just call .update() or whatever to reset the key stuff in the internal state". Having to call an update() seems a little much though.

    Btw, another really cool application of extending the built-ins: applying it to functions too. So we could have something like:

    function linearPointDiff(a, b) {
     return this.m;
    }
    
    numbers.createLinearFn = function(m, b) {
     //if this mode of extension doesn't make sense,
     //see the footnote about extending arrays at the bottom
     var func = function(x)  { return m*x + b; }
     func.isContinuous = true;
     func.m = m;
     func.b  = b;
     func.pointDiff = linearPointDiff;
     return func;
    }
    
    var line = numbers.createLinearFn(3, 0);
    line(4); //fuck, we can evaluate the function directly!
    line.pointDiff(5); //and it has methods
  2. If we decide not to have our objects extend the built-in data types, then we should create a consistently-named method on every object for getting from that object to a representation of it using a built-in structure. Maybe something like toBuiltIn. That way the user can call this method at the end of the chain before handing their data off to the next part of their application.
  3. For convenience, I think we should allow subclass methods to be called on super-class objects where applicable. For instance, if someone has created a Matrix object and tries to call determinant, that should transparently forward the call to the determinant on SquareMatrix (if the Matrix is square).

    I don't like the idea of the superclass definition knowing about its subclasses—the coupling seems way too tight—but we should be able to avoid that if we just keep all the code for adding the forwarding (which would modify Superclass.prototype directly) in a separate part of the codebase. That way, there'd still be one location with the primary superclass definition and that chunk of code could easily be transplanted into another project and operate without any dependencies on the subclass.

Finally, there were two other things that were bothering me:

  1. If there are things that we don't want to put in objects, what are they and where should those go? As I mentioned in the other issue, one example of this might be the methods that operate on single numbers, because having to create a Number object to house those methods seems like overkill. In the other issue, I proposed putting these in a numbers.util "static class" that would function basically how the library does now. But is that the best option? What are the alternatives?
  2. As we think about restructuring in terms of objects, the objects that make the most sense to me seem to be those around mathematical constructs (Set, Sequence, Distribution, Function, etc), but this is very different from our current taxonomy which is based around mathematical fields (calculus, stats, linear algebra, primality, etc).

    Now, on one hand, I can see this switch actually resolving some ambiguity and confusion. For instance, why are min and max on basic whereas median and mode are on stats? In a restructuring, they'd all be united under Set.

    But I'm worried that having these mathematical constructs as the top-level organizational structures might make the library seem less accessible from the outside. Maybe that's just a documentation issue, though? I.e. we could tag each method with the mathematical fields it's relevant too, and then the docs would still be able to show all the stats methods or all the calc methods. The other option would be to use these structures internally but somehow expose an API structured like the current one...but I can't imagine how that would work. Does this seem like a big problem?

Overall thoughts? Additions?

Sorry for the length, but the API is arguably the most important design decision for the library's success, so it seemed worth a full discussion.

CCs: @sjkaliski, @davidbyrd11. Also @KartikTalwar, whose been contributing a lot so might want to follow these developments.

* Extending Array in javascript is a mess, but it can be done workably by having an object constructor that just returns a native array with methods tacked onto it directly ("parasitic inheritance" in Crockford-ese). And this can even be performant if the constructor tacks on functions which are only created once in an outside scope and then simply referenced by the returned array's properties...somehow, this even seems to end up faster than standard prototypal inheritance (I guess because Chrome really optimizes Array construction). I made a test for this here.

zikes commented 11 years ago

:+1:

If this results in a significant code reorganization it might be prudent to also tackle the cross-platform/building issue at the same time. I've written an example barebones template on #49 that I think will work, but if not just let me know and I'll work on some alternatives.

sjkaliski commented 11 years ago

@ethanresnick thanks for getting this started. And I agree with the majority of your points. Here are my opinions.

The library was initially broken down by field because, from a usability standpoint, it seemed more straightforward. You would never say "oh, my app needs matrices and sets." You'd say "I need statistics and certain calculus methods."

With that, let's consider this restructure from two standpoints

  1. Internal API
  2. Developer API

Internal API

I agree with the building around mathematic structures. The ones that appear to be most prevalent

These are some of the core building blocks of mathematics, and it makes sense for us to be doing so as well.

Internal methods (e.g. any and all methods that exist on or between a single structure) can remain in that file as well. For example Matrix.add(otherMatrix) is a method that stays within matrix. However, Function will operate across files. In that case, these need to be stored elsewhere...

Developer API

We're building for the non Math PhD developer. The concepts should be easily accessible and easy to grasp, and most important, super easy to use. @davidbyrd11 captured this well: "everyone should feel like they need numbers." In regards to marketing this, it's important this remains at the center of our development, especially if we want a large majority in use.

A developer should be able to do a few things

  1. Use these structures independently of any other functions.
  2. Easy access and interaction with common functions across a variety of mathematical fields.
  3. Use these with other libraries.

Additional Notes

KartikTalwar commented 11 years ago

Definitely agree with Ethan and Steve!

Same as above, organizing functions by their mathematical origins is definitely a better idea. This ought to lower any confusions during categorizing functions.

W.R.T non object functions, I think it would be best to call it numbers.core as a static class. I can see that most functions in this file will be very useful small helper functions and would most likely be javascript Math heavy. (not a great assumption but I think it still makes sense to call it core rather than util).

But I'm worried that having these mathematical constructs as the top-level organizational structures might make the library seem less accessible from the outside

I definitely agree with Ethan's comment. For this library developers, it makes a lot of sense to organize it my mathematical origins but to a user, this might get confusing assuming they don't really care about how it's done and just want to make stuff out of it. I mean, you don't wanna get to a stage where we have algebraic solvers but no one can find them because they are in rings.js. If we notice the /examples folder in the root, I propose that we keep how things are being done there and make example files according to mathematical fields. So, even if you categorize max() and min() as sets, they appear is /examples/basic.js and examples/statistics.js. So doesn't matter how you organize primes in the backend, we should keep a /examples/primes.js so a user can just jump in and use the library.

On that same note, I think you guys should definitely enable the repo wiki so that even if we don't have 'proper' documentation, we can at least make a giant list of all the functions someone can call. We should get to having a documentation part soon too. Maybe find one of those automatic doc generators? We should also use that wiki to do a roadmap or at least standardize those object classes related design decisions.

And finally pointed out by Steve, the purpose of this library is to be usable by an average person not a Ph.D. This person may or may not have enough mathematical knowledge to interpret, appreciate or utilize everything this library offers. From their end, it should be like: "So I just copy paste this line from the examples and mash these together to get this done and in only one line, cool!" (I guess less exaggerated). If we stick to organizing functions by their origins, we should definitely consider aliases few things to either make an average user's life easier or just show off what this library offers. So as an example, having distribution.Gaussian() is nice, but to a user it might mean something more if it were called distribution.normal() or even distribution.bellCurve(). We should internally use the 'proper' name but offer common names to be called for end-users. You don't want to scare someone off by referencing BoxMullerTransform() if you can write randomSample(). (sorry Miles).

Our goal from a users perspective should definitely be the 3 things Steve pointed out!

Those are just my thoughts and opinions. I did exaggerate most cases here but only to get across what I wanted to say. Feel free to correct me where I went wrong.

PS. Ethan, thanks for including me! :)

Edit: Just looked at JSDocs for this repo. I think that takes care of everything :)

milroc commented 11 years ago

Hey all, this issue is great, we should keep working on fine tuning this and set it for the version 1.0.0 milestone

Ethan

  1. I agree about having a method to showcase createX without having numbers.matrix(). I think this is where namespaces should come into play: numbers.create.Matrix(arr). I think this is much cleaner. Also note that the Matrix object should take data or r, c numbers as valid input. We could also take this a step forward and add a namespace in create for certain types of matrices: numbers.create.matrix.identity(3), numbers.create.matrix.square(data), and numbers.create.matrix.square(2). While this is written opposite of what we would say in english, I think it is cleaner.
  2. Something like stuff below might make make creating these functions to reduce code redundancy (pseudocode):
function _getValid(data, type) {
    if (data is valid && type is "matrix") return numbers.create.Matrix(data);
    //etc
}
  1. I like this way to represent objects, however I think we should separate working with function objects into another issue. Because realistically we might want to build a parsing solution or something else. Why?
    • So we don't fill up the library with classes
    • The library will be limited to the representations of functions (e.g: line.pow(n) n will be limited to the possible objects we have created)

Steve

I agree that Square Matrix shouldn't be a class but we should add a _isValid(data, "square") function for developing fucntions that are square objects only. A central (and also developer only) location for the _isValid function.

Kartik

numbers.core YES :D, I think I'm biased but I like core a lot better for this library.

I don't feel like the where of functions matter as much. If they do we can create wrappers in locations that are more obvious to the developer but we should leave the libraries structure to be primarily based on the internal aspects of the library.

Also note that we currently have these comments above each function which help auto generate jsdocs, Steve is working out a solution of where this really should move to.

My Additions

I think we should try to figure out where our priorities are in the library. Namely how we want the developers to work with their data. Should we be flexible and allow for a lot of different programming paradigms? Should we be biased to performance? Should we be biased to usability?

Some prime examples of this are:

Kartik mentioned randomSample() as a function for a random normal variable, but realistically this would need to be split up into several functions: randomNormalSample(); randomLogNormalSample(); etc. I think it might be better for the API if this was reduced to: random.sample("normal"). This would improve the readability of the code and allow for a less bloated API. However doing this results in less readable and potentially less maintainable code for the internal aspects of the API.

Another prime example I think are the way we currently work with arrays, vs other alternatives.

For example, we could write this data manipulation with a map reduce methodology.

core.sum = function(a, b, i, arr) {
  if (typeof a != 'number' || typeof b != 'number') {
    throw new Error("Array must only contain numbers"); 
    // could also omit those values instead
  }
  return a + b;
};

core.flatten = function(a, b, i, arr) {
  return a.concat(b);
};

[0, 1, 2, 3, 4].reduce(sum);

numbers.create.Vector([0, 1, 2, 3, 4]).reduce(numbers.core.sum);
//since this will be a sub-class of arrays.

numbers.create.Matrix([[1, 0], [0, 1]]).reduce(numbers.core.flatten).reduce(numbers.core.sum);

This is not maximizing for performance as in several browsers a for loop will be more performant, but it does help define the way in which we work with the data in statistics. I think that working in this way will help a lot of non-javascript developers get their feet wet with the library, it also is a way of writing javascript that a lot of developers already have grokked so it wouldn't be that bit of a jump from jQuery to this library.

It also makes for easier code maintainability (now we only need to check the individual values rather than if the object is an array).

If we write for performance though, it might be really negligible and determining which runtime will be the benchmark (likely Node and v8) we use to optimize might also be important.

ethanresnick commented 11 years ago

Jumping back into this. Hope you all had nice holidays!

I agree with most of the above and will create an API Principles page on the wiki summarizing these issues.

Specific responses:

Math Fields vs. Data Structures Documentation

I think @KartikTalwar's idea of keeping the examples by mathematical field, in combination with offering a view into the docs by mathematical field, hits the right balance between usability and the internal structure.

Core/Util functions

I think there are really two types of code going on here.

The first is the collection of functions that operate on integers. These seem like they should be in a static class because having to instantiate an Integer object (and having our code return and check for it everywhere) around every number primitive is too heavy. I think we should put them in numbers.number (e.g. numbers.number.isPrime(3)).

The second collection of functions, which could be called core, are methods that are used by more than one structure (e.g. things like sum, square, etc.). I'd have numbers.core hold all these, so that every type has core as a dependency, but none of them will have to be dependent on one another. (Some of these methods should also exist on the objects, e.g. Set should have a sum too, but those methods can call the core methods internally.)

@milroc's reduce examples are very compelling. My thought as of now is that map/reduce should definitely be supported on each type for users who like that paradigm, but that the core methods shouldn't be created to be used this way. If they were, both users and contributors would have to understand how reduce works, and I don't see a sufficient reason to require that extra piece of knowledge.

Factory functions

I'm not sold on numbers.create namespace. Just stylistically, having the extra dot in the middle bother me; createMatrix is a command and the method name reflects that, whereas create.Matrix seems weird. (Maybe that's arbitrary—sorry!—but at least I'm not trying to rationalize it.) For the more complex cases, like numbers.create.matrix.square(2), I think that's just asking for people to confuse it with numbers.create.Matrix.square(2) and, more generally, I think it's ok if the create methods are flat: any structure that needs to go around the types (e.g. parent–child relationships) can be created in the namespaces where the type definitions themselves are stored. The create methods are just meant to hide the new operator, which looks really clunky or unintuitive with method chaining.

Types

It seems like we're going with extending the built-in types, which is a big win for interoperability. Thinking more about the performance costs of lost encapsulation though, I wonder if the answer long-term will be to have two modes: a default one in which the user can transform the data however they want but that's a little slower (because there can't be any internal caching) and a mode the user can opt-in to in which they only transform the data through our API (or, if they must modify it directly, call update when they're done) but therefore gets faster performance. The faster mode could be opted into at construction and changed any point in the object's lifetime. I put together the very start of something like that here so you can see what I'm talking about, but the truth is that this probably not a 1.0 feature. Just a good thing to keep in mind for later.

I've also been thinking more about type flexibility and the best way to achieve it. For example: a Matrix that happens to be one-dimensional should be able to be used as a Vector (i.e. be accepted by Vector methods as another Vector but also have the Vector methods on it). That's a longer conversation, though, so I'll put together a sample implementation later to show what I have in mind.

Overall Structure

Given the above, here's the overall structure I see for the codebase;

  1. numbers.types.[namespace, which is the lowercased name for the top-level type].ClassName (e.g. numbers.types.matrix.Matrix, numbers.types.matrix.Vector)
  2. numbers.number.[static method name]. (e.g. numbers.number.primeFactorization(30))
  3. numbers.core.[static method name]. (e.g. numbers.core.sum)
  4. numbers.create[Class Name] All these methods would be added automatically though, based on the types. Ex code for that:

    for(var ns in numbers.types) {
     for(var type in numbers.types[ns]) {
       numbers["create" + type] = (function(constructor) {
          return function() {
            //new operator not needed because the objects
            //extend built-ins rather than modify "this"
            return constructor.apply(undefined, arguments);
          };
      }(numbers.types[ns][type]));
    }
    }

Misc.

I also like the idea of aliases, though I'd add a couple qualifications. First, to the extent we have a "canonical name" separate from the aliases, that name should be the one used most often by our users—even if it's not the "proper" name—because none of this code is truly internal (other must contribute to it and may just take bits of it). Second, I'd cap the number of names for something in the codebase at 3, while trying to keep it at no more than 2; otherwise, things'll get too messy.

As for validation, I agree with @milroc that it needs to be centralized (at least within each "class"), but beyond that I'm not sold on a specific solution. We can discuss this in another issue though, as it's not related to the API. Ditto handling function objects.

Other final thoughts?

milroc commented 11 years ago

Something I thought recently, that might be a little more important, is analyzing what it would cost to use node.js only systems (node-fibers or something else). I don't know where we stack up performance wise to other numerical/stat libraries. It might be worth adding an issue for someone to find out later in development.

Another thought is to consider porting C++ libraries into numbers.js for node.js (not with llvm.js or emscripten, thus also meaning that we can't support browsers). This needs to be on rare transactions that aren't used frequently. The cost to cross node into native code is rather high.

Both of these mean that we wouldn't have browser support from my understanding. I also am not sure about the performance benefits.

I want to read what you wrote more in depth before I comment but initially it looks good.

milroc commented 11 years ago

Ethan,

I read a little deeper and only two things really seemed worth mentioning to me:

  1. numbers.number just sounds odd and bloated. The shorter the API as a whole the better. I would recommend collapsing this (and the things in core). Distinguishing what needs to be put into the function should be part of the documentation or disguised in the API itself (put in a number but wanted an array just throw the number into it's own array).
numbers.isPrime(n); 
numbers.sum([0,1,2,3]);

numbers.number.isPrime(n);
numbers.core.sum([0,1,2,3]); 
//if we're working with these distinctions, core should be array.

Also note that: * A good chunk of the things currently in core will need to be refactored to work in a much more general number case (the addition of complex numbers for example). * We may refactor at some point to allow less biases in the library (for example, rather than throwing errors, working with NaN is valuable or representing infinity in our own way). * A lot of these can be ran on a lot of things (I could see myself doing [0,1,2,3].map(numbers.isPrime);). But other methods could be possible (e.g isPrime may be extended to work with arrays, and we may create a way to support that, distinguishing that it's meant for numbers isn't entirely true in the future).

  1. As for centralizing the location of the object checking/creation (see the _getValid for an 'idea' on what I mean by creation); I think it's very important. If we want these objects to work well with each other (not necessarily from an efficiency standpoint, a set of matrices would be slow), centralization will become a necessity.
ethanresnick commented 11 years ago

Miles,

On reflection, I totally agree with your point 1. Let's put the single integer functions and the core stuff directly on numbers.

Also, as I've started to take a stab at implementing Set (I'll push this soon and we can talk about it in the other issue), I've realized that we do need a lot of methods that are best called "utility methods". Things like copying an array by value, patching support for indexOf cross-browser, patching typeof's failure to identify arrays, etc. So I say we should add back a numbers.utility where all these can go.

Re NaN, I can definitely see a place for it, but I also think it's too silent/slippery to replace throwing errors in most cases.

milroc commented 11 years ago

So if we could collapse this to one definitive "API v0.2.0" I'd be happy to try to convert everything to that by the end of this week or next. Or someone else could if they'd like to.

ethanresnick commented 11 years ago

I'll work on the definitive doc. Your help with the code conversion is much appreciated!

ethanresnick commented 11 years ago

This doc is now in progress on the wiki. Also, I'm going to start putting some of these things we've discussed into action on the abstract-structures branch.