Open rreusser opened 9 years ago
+1. Consistency should be considered paramount. Speaking from experience, in a federated system, going back and making things consistent ends up being a massive undertaking.
Re: 2. You could return NaN
instead. I do know that Python chooses to return false
for some things, but I am not sure this is a good practice. Our approach has been the following:
null
when inputs are valid but no computation can be run; e.g., supplying an empty array
to a function which calculates the mean. NaN
when either the function is not defined for some value, or when a failure occurs.Error
if the user provides invalid inputs.Basically, to reiterate the point you are making, users should be able to follow a consistent set of conventions to understand function output:
Error
means I did something wrongNaN
means internally something went awry, or the input is valid but undefinednull
means nothing could be computedArray|Object|Number|String
means that everything is okayYou have any further thoughts @Planeshifter ?
I think those conventions are reasonable. Assuming the default output is true
if the function just mutates its arguments.
My thinking is that, even if a function
mutates its arguments, it should still return a reference to the results; e.g., the original array
.
Of course, for a function
which mutates several arguments, this does not work. Although, if a function does mutate more than one input argument, I might question the API/implementation. Mutation is already a source of bugs and requires a greater understanding about data flows. If a function mutates different arguments in different ways, I might view this as code smell and the problem may be a candidate for further decomposition.
In short, I would argue that a function should return what a user might expect it to return. For example, if a function downsamples an array
, mutating the original array
input, I would still expect that, in general, a function which downsamples returns an array
. I would not expect it to return a boolean
.
Agreed for single-output functions. What would you recommend for, e.g., LU factorization. It necessarily mutates L and U. I would tend toward:
function LU( A, L, U ) {
…
return [L, U];
}
Default usage would be LU( A, L, U)
, but in some cases the factorization may be in-place. Docs would spell out that in-place factorization usage is LU( A, A, U)
.
Not sure how the JIT compiler handles the latter. Does that prevent optimization? (I recall that Fortran being explicit about aliasing was one of its claimed advantages over C/C++ when it came down to raw speed, but not sure I recall that correctly and not sure that's a meaningful level of concern here. [Edit: yes, I am sure; it's not. :smile:]) I've been trying to avoid doing things like allocating/duplicating/assigning huge chunks of memory so that these functions aren't needlessly incompatible with inner loops, but it's challenging to find a perfect balance.
Fun reading: Destructuring Assignment
Agreed. Matlab goes with outputting an array [L,U]
. Of course, Matlab also passes by value.
Re: internal memory allocation. Totally. Constantly allocating new chunks of memory is one reason why R can be slow. Our approach, where applicable, has been to provide an options
object, having a copy
option. If true
, the function returns, e.g., a new array
. Otherwise, all operations work directly on the input argument(s). Whether the default is copy=true
or false
depends on the context and expectation.
Re: optimization. Depends. If, say, an input array
is mutated, such that all numeric array
elements are replaced with objects
, then the array
is de-optimized (boxing, unboxing). If you go the other way (from objects
--> numbers
), you can make the array
a candidate for future optimization if used extensively later. Basically, optimization can be affected if you are changing the data types of array
elements.
Otherwise, nothing inherent in the LU( A, A, U )
signature which would indicate that the function cannot be optimized.
Re: destructuring. Yes, this is a nice possibility, but we will probably be waiting awhile until this is adopted and optimized. :)
I don't like returning a compound objects for multiple values or aggressively allocating returned memory for users. One thing I would like to do with ndarray-* libs is never have them return memory by allocation, that is the user should be responsible for all inputs and outputs. The reason for this is that memory management is very crucial in JavaScript, because:
I think the way to do this for now is to have the user of each module be responsible for this.
Long term, I would like to build some libraries on top of ndarray which simplify memory management and make things easier for users, but for the core functionality (like BLAS, matrix factorizations, etc.) it is better to not allocate stuff and return it.
For temporary memory, I use this module:
https://github.com/mikolalysenko/typedarray-pool
This avoids memory allocation and initialization for typed arrays.
Regarding LU, my preference would be to follow API conventions like this library:
That makes sense. I'm trying to write a bunch of stuff, but I'm totally open to critique or blowing it all away and trying again with a better approach, so thanks for the input! Don't hesitate to let me know if you see things (LUP = good example) that just need to be changed.
I'm a little worried about performance in general since it seems like a stretch to throw some stuff together and not end up with something kinda worthless next to transpiled LAPACK. It's still a little unclear to me whether we're aiming for a good scripting language like numpy or more of a low-level workhorse like BLAS/LAPACK. I've tried hard to have everything fall back up to cwise for performance, but gemm is an obvious example of where that falls apart and there's not a simple answer.
Re: allocating memory. I think I've succeeded (let me know if you see otherwise) in not allocating scratch space or, in the case of householder QR, at least trying to encapsulate it by providing a method that allocates just the right amount and then packing it to avoid filling a square matrix half full. I don't love it, but the goal was to be at least explicit and friendly about it. Thinking about it though, I should really just return a size and let the user allocate/free a typed array from the pool.
I guess returning in-place-modified arrays is redundant. Maybe true
is the right return code for such things.
Also regarding LU: yes. ndarray-crout-decomposition handles memory and input/output somewhat better than mine. I'm working now on minimizing data movement. The pivoting is tricky if you want it in-place with L and U in the same matrix.
Thanks for the feedback!
A few inconsistencies are creeping in. For one, what's the preferred mode of failure? Two general cases are:
I'd tend to throw a TypeError for 1 and return false for 2 but mainly just want to resolve a couple consistency questions so it doesn't require a bunch of major version changes to make things consistent at some point in the future.
Compute.io has an opinionated contributing guide to solve this. I'm not advocating getting too deep into style convention minutiae, but would just like to think about this less by defining a few conventions.
As usual, willing to draft something minimal and do organizational work just to get something out there to iterate on. Open to any thoughts here.