scijs / contributing

Contribution guide lines
10 stars 0 forks source link

Consistent error handling #17

Open rreusser opened 9 years ago

rreusser commented 9 years ago

A few inconsistencies are creeping in. For one, what's the preferred mode of failure? Two general cases are:

  1. Algo fails due to invalid input (e.g. Incorrect type)
  2. Algo fails due to failure of algorithm (e.g. Terminates early due to zero pivot)

I'd tend to throw a TypeError for 1 and return false for 2 but mainly just want to resolve a couple consistency questions so it doesn't require a bunch of major version changes to make things consistent at some point in the future.

Compute.io has an opinionated contributing guide to solve this. I'm not advocating getting too deep into style convention minutiae, but would just like to think about this less by defining a few conventions.

As usual, willing to draft something minimal and do organizational work just to get something out there to iterate on. Open to any thoughts here.

kgryte commented 9 years ago

+1. Consistency should be considered paramount. Speaking from experience, in a federated system, going back and making things consistent ends up being a massive undertaking.

Re: 2. You could return NaN instead. I do know that Python chooses to return false for some things, but I am not sure this is a good practice. Our approach has been the following:

Basically, to reiterate the point you are making, users should be able to follow a consistent set of conventions to understand function output:

You have any further thoughts @Planeshifter ?

rreusser commented 9 years ago

I think those conventions are reasonable. Assuming the default output is true if the function just mutates its arguments.

kgryte commented 9 years ago

My thinking is that, even if a function mutates its arguments, it should still return a reference to the results; e.g., the original array.

Of course, for a function which mutates several arguments, this does not work. Although, if a function does mutate more than one input argument, I might question the API/implementation. Mutation is already a source of bugs and requires a greater understanding about data flows. If a function mutates different arguments in different ways, I might view this as code smell and the problem may be a candidate for further decomposition.

In short, I would argue that a function should return what a user might expect it to return. For example, if a function downsamples an array, mutating the original array input, I would still expect that, in general, a function which downsamples returns an array. I would not expect it to return a boolean.

rreusser commented 9 years ago

Agreed for single-output functions. What would you recommend for, e.g., LU factorization. It necessarily mutates L and U. I would tend toward:

function LU( A, L, U ) {
  …
  return [L, U];
}

Default usage would be LU( A, L, U), but in some cases the factorization may be in-place. Docs would spell out that in-place factorization usage is LU( A, A, U).

Not sure how the JIT compiler handles the latter. Does that prevent optimization? (I recall that Fortran being explicit about aliasing was one of its claimed advantages over C/C++ when it came down to raw speed, but not sure I recall that correctly and not sure that's a meaningful level of concern here. [Edit: yes, I am sure; it's not. :smile:]) I've been trying to avoid doing things like allocating/duplicating/assigning huge chunks of memory so that these functions aren't needlessly incompatible with inner loops, but it's challenging to find a perfect balance.

Fun reading: Destructuring Assignment

kgryte commented 9 years ago

Agreed. Matlab goes with outputting an array [L,U]. Of course, Matlab also passes by value.

Re: internal memory allocation. Totally. Constantly allocating new chunks of memory is one reason why R can be slow. Our approach, where applicable, has been to provide an options object, having a copy option. If true, the function returns, e.g., a new array. Otherwise, all operations work directly on the input argument(s). Whether the default is copy=true or false depends on the context and expectation.

Re: optimization. Depends. If, say, an input array is mutated, such that all numeric array elements are replaced with objects, then the array is de-optimized (boxing, unboxing). If you go the other way (from objects --> numbers), you can make the array a candidate for future optimization if used extensively later. Basically, optimization can be affected if you are changing the data types of array elements.

Otherwise, nothing inherent in the LU( A, A, U ) signature which would indicate that the function cannot be optimized.

Re: destructuring. Yes, this is a nice possibility, but we will probably be waiting awhile until this is adopted and optimized. :)

mikolalysenko commented 9 years ago

I don't like returning a compound objects for multiple values or aggressively allocating returned memory for users. One thing I would like to do with ndarray-* libs is never have them return memory by allocation, that is the user should be responsible for all inputs and outputs. The reason for this is that memory management is very crucial in JavaScript, because:

  1. There isn't much of it
  2. Allocating it is slow
  3. Garbage collection is expensive and unacceptable for realtime applications (like vision or games)

I think the way to do this for now is to have the user of each module be responsible for this.

Long term, I would like to build some libraries on top of ndarray which simplify memory management and make things easier for users, but for the core functionality (like BLAS, matrix factorizations, etc.) it is better to not allocate stuff and return it.

For temporary memory, I use this module:

https://github.com/mikolalysenko/typedarray-pool

This avoids memory allocation and initialization for typed arrays.

Regarding LU, my preference would be to follow API conventions like this library:

https://github.com/substack/ndarray-crout-decomposition

rreusser commented 9 years ago

That makes sense. I'm trying to write a bunch of stuff, but I'm totally open to critique or blowing it all away and trying again with a better approach, so thanks for the input! Don't hesitate to let me know if you see things (LUP = good example) that just need to be changed.

I'm a little worried about performance in general since it seems like a stretch to throw some stuff together and not end up with something kinda worthless next to transpiled LAPACK. It's still a little unclear to me whether we're aiming for a good scripting language like numpy or more of a low-level workhorse like BLAS/LAPACK. I've tried hard to have everything fall back up to cwise for performance, but gemm is an obvious example of where that falls apart and there's not a simple answer.

Re: allocating memory. I think I've succeeded (let me know if you see otherwise) in not allocating scratch space or, in the case of householder QR, at least trying to encapsulate it by providing a method that allocates just the right amount and then packing it to avoid filling a square matrix half full. I don't love it, but the goal was to be at least explicit and friendly about it. Thinking about it though, I should really just return a size and let the user allocate/free a typed array from the pool.

I guess returning in-place-modified arrays is redundant. Maybe true is the right return code for such things.

Also regarding LU: yes. ndarray-crout-decomposition handles memory and input/output somewhat better than mine. I'm working now on minimizing data movement. The pivoting is tricky if you want it in-place with L and U in the same matrix.

Thanks for the feedback!