stdlib-js / google-summer-of-code

Google Summer of Code resources.
https://github.com/stdlib-js/stdlib
26 stars 7 forks source link

[RFC]: achieve ndarray API parity with built-in JavaScript arrays #61

Closed rxbryan closed 5 months ago

rxbryan commented 6 months ago

Full name

Bryan Elee Atonye

University status

Yes

University name

University of Port Harcourt

University program

Mathematics and Computer Science

Expected graduation

July

Short biography

My name is Bryan Elee. I am in my final year pursuing a degree in Mathematics and Computer Science. I recently completed my final exam and project defense, hence I'm awaiting graduation in the next couple months. I possess a strong foundation in various programming languages, with over 5 years of programming experience with C/C++, Python, and JavaScript, honed through academic studies, practical projects and internships. I have previously participated in the Google Summer of Code 2022 under the Metacall Organization and in the summer of Bitcoin program last year working under the Ledger Organization. This experiences solidified my ability to work effectively within open-source communities and collaborate with experienced developers. I'm interested in machine learning, especially the field of reinforcement learning. I did some work on reinforcement learning last year and I am very excited about the possibilities offered by this technology.

Specific Achievements:

Timezone

UTC +1

Contact details

rxbryn@gmail.com

Platform

Linux

Editor

Sublime Text

Programming experience

I began programming before university in 2018. I started out writing shell scripts, moved on to C/C++, then the Python programming language, Javascript and NodeJS. I am self taught in the above languages, I was usually motivated by some project I was developing. I have worked on a couple projects but I am most proud of Resigner.

Resigner is an easy to program hot signing service for miniscript policies. The Resigner countersigns transactions (according to some rules (spending conditions), set in advance in the configuration file, for example “no more than 1 million satoshis per day” before the transaction is broadcast to the bitcoin network. It provides the following features:

It acts as a trusted third party in multiparty transactions enforcing previously agreed conditions

JavaScript experience

I have about 3 years of experience writing javascript programs. I have two published npm packages http-date and http-preconditions. I also have some experience doing backend web development using NodeJS, Express. I have contributed Javascript to a few open source projects such as

My favourite feature in javascript would be function prototypes. While this pattern has fallen out of favour being replaced by the class syntax, the prototype pattern provides an interesting approach for dynamic inheritance of object properties and behaviour.

My least favourite feature in Javascript is the event loop. While the event loop is responsible for the asynchronous behaviour in javascript, it is also makes writing true multithreaded javascript applications very difficult. Any attempt at optimising javascript code requires deep understanding of the nature of the event loop and how it affects the specific code being optimised. This experience is not readily available.

Node.js experience

My experience with NodeJS is quite extensive. I have some experience modifying NodeJS source code and compiling the Library for embedding purposes. Some of my experience developing node native addons and embedding NodeJS comes from contributing to the development of the node loader in metacall core. This draft PR contains a lot of my work in embedding nodejs. It was used as the base for implementing the feature for exporting classes and objects form nodejs to metacall. I also have some experience developing web applications using nodejs, express.js. I have also published some npm packages as I have elaborated on in the javascript section

C/Fortran experience

The C programming language is the first language I learnt, the second being C++. It is the language that I have clocked the most years of experience. I used C extensively while paticipating in the summer of code 2022 under metacall and I also worked on some personal projects using C. Some of my contributions to open source projects using C include https://github.com/metacall/core/pull/289 https://github.com/metacall/core/pull/270 https://github.com/metacall/core/pull/287 https://github.com/metacall/core/pull/298 Some of these merged PRs include C++ code. But still demonstrates my the requisite skill

Interest in stdlib

My interest in Stdlib is twofold.

Version control

Yes

Contributions to stdlib

Merged contributions

refactor: update blas/ext/base/sapxsumpw to follow current project conventions refactor: update blas/ext/base/scusumors to follow current project conventions refactor: update blas/ext/base/scusumpw to follow current project conventions refactor: update blas/ext/base/sapx to follow current project conventions

Goals

The goal of this project is to achieve API parity for Stdlib native ndarray with built-in JavaScript Array. Of all the existing JavaScript array method only the at and slice methods exist in ndarray.

Each of the APIs is a standalone package in either the @stdlib/ndarray/base or @stdlib/ndarray directory

Each package would have this file structure

Package_name
             |benchmark
             |docs
                       |types
                        repl.txt 
             |examples
             |lib
             |test
             Readme
             package.json

The following APIs will be implemented during the course of this project:

ndarray slice semantics for representing indices

APIs taking an Index or multiple indices will utilise the slice semantics. We shall use the slice API as it is, hence APIs such as fill, copywithin, splice etc shall take a slice object, array of slice objects or a multislice object.

Dimensionality Reduction

In APIs which it would be suitable to support operating over specific axes, we will be utilising approach used by numpy. A null axis, (the default) is would perform the operation over all the dimensions of the input ndarray. If this is an array of ints, a reduction is performed on multiple axes, instead of a single axis or all the axes as before. For example, given a three dimentional ndarray, axis = 0 represent reducing along the depth. 1 represents represent reducing along the row and 2 represents represent reducing along the column

Accessors

ndarray APIs taking a callback such as unary implement optimised accessors for dimensions upto the 10d. We shall use this approach while implementing the APIs requiring callbacks

APIs

/**
* Returns an ndarray.
*
* @param {ndarrayLike} x - input ndarray
* @param {ndarrayLike}y - input ndarray
* @throws {TypeError} first argument must be an ndarray
* @throws {TypeError} second argument must be an ndarray
* @returns {ndarray} ndarray view
*/
concat(x, y)
/**
* Returns an ndarray.
*
* @param {ndarrayLike} x - input ndarray 
* @param {...integer} target - Zero-based index at which to copy the sequence to.
* @param  {...*} s - slice arguments: a MultiSlice instance, an array of slice arguments, or slice arguments as separate arguments.
* @throws {TypeError} first argument must be an ndarray
* @throws {TypeError} index arguments must be integers
* @throws {RangeError} number of index arguments must equal the number of dimensions
* @returns {ndarray} target - ndarray view
*/
copywithin(x, target, s)
/**
* Returns an ndarray.
*
* @param {ndarrayLike} target - input ndarray
* @param {Number} value
* @param  {...*} s - slice arguments: a MultiSlice instance, an array of slice arguments, or slice arguments as separate arguments.
* @throws {TypeError} first argument must be an ndarray
* @throws {TypeError} value and index arguments must be integers
* @throws {RangeError} number of index arguments must equal the number of dimensions
* @returns {ndarray} target - ndarray view
*/
fill( target, value[, s] )

APIs that take a callback

/**
* Returns an ndarray.
*
* @param {ndarrayLike} x - input ndarray
* @param {Callback} fcn - A function to execute for each element in the array.
* @param {...*} axis - null or int or array of ints, optional
* @throws {TypeError} first argument must be an ndarray
* @returns {ndarray} target - ndarray view
*/
filter(x, fcn[, axis])
/**
* Returns an ndarray element.
*
* @param {ndarrayLike} x - input ndarray
* @param {Callback} fcn - unary callback
*  @param {...*} axis - null or int or array of ints, optional
* @throws {TypeError} first argument must be an ndarray
* @returns {*} ndarray element
*/
find(x, fcn[, axis])
/**
* Returns an ndarray element.
*
* @param {ndarrayLike} x - input ndarray
* @param {Callback} fcn - A function to execute for each element in the array.
* @param {...*} axis - null or int or array of ints, optional
* @throws {TypeError} first argument must be an ndarray
* @returns {*} ndarray element
*/
findlast(x, fcn[, axis])
/**
* Executes a provided function once for each array element.
* 
* @param {ndarrayLike} x - input ndarray
* @param {Callback} fcn - A function to execute for each element in the array.
* @throws {TypeError} first argument must be an ndarray
* @returns {void} 
*/
foreach(x, fcn)
/**
* Returns an ndarray.
*
* @param {ndarrayLike} x - input ndarray
* @param {Callback} fcn - A function to execute for each element in the array.
* @returns {ndarray} ndarray 
*/
from(x [, fcn])
/**
* Returns an ndarray.
*
* @param {ndarrayLike} x - input ndarray
* @param {Callback} fcn - A function to execute for each element in the array.
*  @param {...*} axis - null or int or array of ints, optional
* @throws {TypeError} first argument must be an ndarray
* @returns {ndarray} output ndarray 
*/
map(x, fcn[, axis])
/**
* Returns the value that results from running the "reducer" callback function to completion over the entire array..
*
* @param {ndarrayLike} x - input ndarray
* @param {Callback} fcn - A function to execute for each element in the array.
* @param {...*} axis - null or int or array of ints, optional
* @returns {Number}  
*/
reduce(x, fcn [, initialvalue, axis])
/**
* Returns an ndarray.
*
* @param {ndarrayLike} x - input ndarray
* @param {Callback} fcn - A function to execute for each element in the array.
* @param {...*} axis - null or int or array of ints, optional
* @returns {ndarray} output ndarray 
*/
reduceright(x, fcn [, initialvalue, axis])
/**
* Returns an ndarray.
*
* @param {ndarrayLike} x - input ndarray
* @param {Callback} [fcn] - optional callback
* @param {...*} axis - null or int or array of ints, optional
* @returns {ndarray} reference to input array
*/
sort(x, [, fcn, axis])
/**
* Returns an ndarray.
*
* @param {ndarrayLike} x - input ndarray
* @param {Callback} [fcn] - optional callback
* @param {...*} axis - null or int or array of ints, optional
* @returns {ndarray} A new array with the elements sorted in ascending order.
*/
tosorted(x [, fcn, axis])
/**
* Returns a boolean .
*
* we don't have support for `bool` in ndarray, hence we flatten the input ndarray and operate over all the single dimension
* 
* @param {ndarrayLike} target - input ndarray
* @param {Number} searchElement - The value to search for. 
 * @param  {...*} s - slice arguments: a MultiSlice instance, an array of slice arguments, or slice arguments as separate arguments.
* @throws {TypeError} first argument must be an ndarray
* @throws {TypeError} index arguments must be integers
* @throws {RangeError} number of index arguments must equal the number of dimensions
* @returns {boolean} 
*/
includes( x, searchElement [, s])
/**
* Returns a string.
*
* @param {ndarrayLike} x - input ndarray
* @param {String} seperator
* @throws {TypeError} first argument must be an ndarray
* @returns {String} 
*/
join(x, seperator)
/**
* Returns an ndarray.
*
* @param {ndarrayLike} x - input ndarray
* @returns {ndarray} reference to input array
*/
reverse(x)
/**
* Returns an ndarray.
*
* @param {ndarrayLike} x - input ndarray
* @param {...integer} [start] - Zero-based index at which to start changing the array, converted to an integer.
* @param {integer} [deletecount] - An integer indicating the number of elements in the array to remove from start.
* @param {*} [ item1, item2, /* …, */ itemN] - The elements to add to the array, beginning from start.
* @returns {ndarray} reference to input array
*/
splice(x, start [, deleteCount, item1, item2, /* …, */ itemN])
/**
* Returns an ndarray.
*
* @param {ndarrayLike} x - input ndarray
* @returns {ndarray} A new array containing the elements in reversed order.
*/
toreversed(x)
/**
* Returns an ndarray.
*
* @param {ndarrayLike} x - input ndarray
* @param {...integer} [start] - Zero-based index at which to start changing the array, converted to an integer.
* @param {integer} [deletecount] - An integer indicating the number of elements in the array to remove from start.
* @param {*} [ item1, item2, /* …, */ itemN] - The elements to add to the array, beginning from start.
* @throws {TypeError} first argument must be an ndarray
* @returns {ndarray} A new array that consists of all elements before start, item1, item2, …, itemN, and all elements after start + deleteCount.
*/
tospliced(x, start [, deleteCount, item1, item2, /* …, */ itemN])
/**
* Returns a string.
*
* @param {ndarrayLike} x - input ndarray
* @throws {TypeError} first argument must be an ndarray
* @returns {String} A string representing the elements of the array.
*/
tostring(x)
/**
* Returns A new iterable iterator object.
*
* @param {ndarray} x - input array
* @returns {Iterator} iterator
*/
values(x)
/**
* Returns an ndarray.
*
* @param {ndarrayLike} x - input ndarray
* @param {Callback} fcn - A function to execute for each element in the array.
* @returns {ndarray} ndarray 
*/
flatmap(x , fcn)
/**
* Returns an ndarray.
*
* @param {ndarrayLike} x - input ndarray
* @param {...*} axis - null or int or array of ints, optional
* @param {integer} [depth]  
* @returns {*} ndarray element
*/
flat(x [,depth, axis])

Why this project?

Ndarrays are foundational to working with the stdlib library. They provide an efficient way to work with multi-dimensional numerical data. This project is a high priority for Stdlib for the fore-mentioned reason. It adds APIs that would be utilised in every package in the library. The Knowledge of working with multi-dimensional numerical data is a highly valuable skill for data science and machine learning, career paths I intend on pursuing. A significant portion of data science and machine learning involves working with numerical data, often organized in multi-dimensional structures like matrices and tensors. These structures represent complex relationships between features and observations. Understanding how to manipulate, analyze, and interpret this data is very important, this project hence affords me first hand experience with the ndarray object. I also stand to gain knowledge optimal techniques and patterns for iterating multidimensional arrays, possibly other optimisation techniques that might be used during the course the project.

Qualifications

I have completed the course work for a degree in Mathematics and Computer science. The relevant courses to this project would be Linear algebra, Numerical analysis, Data structures and algorithms.

I am also acquainted with the book Algorithms, 4th Edition by Robert Sedgewick and Kevin Wayne. It helped develop my understanding of both data structures and algorithms.

I am also quite familiar with the emcascript specification. The definitions and implementations of the APIs will be informed by it

Prior art

The at and slice methods exist in ndarray. Various ndarray APIs have also being implemented. They will inform and guide our implementation of the project

Commitment

As stated in background section, I recently completed my final exam and project defense. Hence I'm free from any major commitments and will be able to give a ~40hr/week to this project

Schedule

Assuming a 12 week schedule,

Each of the APIs to be implemented is standalone, and will not be considered implemented without its benchmarks, tests, documentations and examples. So rather than having a week for documentation, tests and so on...I intended to submit PRs to atleast 3 APIs per week.

Related issues

No response

Checklist

Pranavchiku commented 6 months ago

@rxbryan this looks good, thanks for the proposal! I am not so familiar with ndarray APIs so no detailed suggestions, generic ones include explaining implementation of one of the APIs, the necessary files you may need to add, changes, etc. Link the [Idea] issue in Related issues section. Rest all is good.

rxbryan commented 6 months ago

Thanks for the review @Pranavchiku.

Link the [Idea] issue in Related issues section. Rest all is good.

I don't think there's an issue related to this project. Should I create one?

steff456 commented 6 months ago

@rxbryan thanks for opening this draft proposal!

I think overall the projects looks very good, I will ask you to add all the APIs in a list such that we can create in a future a tracking issue like this one. What I am seeing from this proposal is that you are not leaving room for review cycles, specially at the end of the project. It would be nice to leave room for these interactions and maybe if you are interested in writing a blog post for documenting your journey we can add it towards the final month of this proposal (please note that it is completely optional and it is up to you :) ).

kgryte commented 6 months ago

Building on the previous comments, I have a few comments of my own:

  1. Any mutation APIs (e.g., push, shift, unshift, pop) are non-starters. ndarrays have fixed memory. So I am a bit curious why you included these in your proposal. Furthermore, how are these APIs suppose to work in a multi-dimensional context?
  2. Various APIs (e.g., every, some, etc) return a boolean. But this is a bit limiting, as we'd likely want to support operating over specific axes, resulting in an array of lower rank. Your proposal seems to propose flattening the input array. While useful in some contexts, this is limited. Furthermore, ndarrays APIs should not return scalars, but zero-dimensional ndarrays. Additionally, if you are planning to support dimensionality reduction, do note that we have yet to implement a bool dtype. So how will you support?
  3. Various APIs (e.g., indexOf, findIndex, etc) return indices. We only support up to int32/uint32 dtypes, but ndarrays are allowed to have more than 2**32-1 elements. How are you planning to support?
  4. In general, you need to rethink your API proposals to account for dimensionality reduction semantics.
  5. How will the reduction APIs work?
  6. For things like fill, you've specified integer indices. This likely does not make sense, as we're more likely to support slicing semantics. In fact, it is not clear why we'd want to support index arguments at all, given that views can be defined in userland.
  7. In general, based on your API documentation, I am not sure you've actually studied our existing ndarray packages. Many of the API proposals seem LLM generated and not grounded in how we'd design APIs.
rxbryan commented 6 months ago
  1. Any mutation APIs (e.g., push, shift, unshift, pop) are non-starters. ndarrays have fixed memory. So I am a bit curious why you included these in your proposal. Furthermore, how are these APIs suppose to work in a multi-dimensional context?

I assumed here that the user of this APIs (splice, push, shift, unshift, pop) would be more interested in manipulating the dimensions of the ndarray (ie the shape of the ndarray). We could use slice to extract a specific "sub-ndarray" from the n-dimensional array, while using assign, slice-assign to copy to a new ndarray.

we could then assign a placeholder to the original elements we want to effectively shift, or pop. We could do the reverse for shift and push, adding a sub-array

  1. Various APIs (e.g., every, some, etc) return a boolean. But this is a bit limiting, as we'd likely want to support operating over specific axes, resulting in an array of lower rank. Your proposal seems to propose flattening the input array. While useful in some contexts, this is limited.

Yes the proposal assumes that the input array will be viewed as a flat single dimensional array for the includes, every, some.

We only support up to int32/uint32 dtypes

my intention here was to implement the APIs within existing state of ndarray API which now seems a bit short sighted.

For things like fill, you've specified integer indices. This likely does not make sense, as we're more likely to support slicing semantics. In fact, it is not clear why we'd want to support index arguments at all, given that views can be defined in userland.

I would have to agree that the slicing semantics would be more appropriate. I think the proposal suffers alot from conforming to the JavaScript SPEC of the APIs

  1. In general, based on your API documentation, I am not sure you've actually studied our existing ndarray packages. Many of the API proposals seem LLM generated and not grounded in how we'd design APIs.

The API documentation was derived from https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Array and the JSdoc from existing APIs in ndarray/base not any LLM. It wasn't the final implementation of the APIs nor was any attempt made to follow the project design principles. It was just meant to visualize my idea of the APIs. I quite sorry if it looks plagarised!

I have to admit that is proposal was over enthusiastic. The attempt to cover the entire JavaScript built-in prototype methods have broadened the scope of this proposal than it should have been. Implementing this proposal as it is would require implementing this project idea https://github.com/stdlib-js/google-summer-of-code/issues/43 and Uint64 dtype support. With just one day to end of project submission I don't think this project can be updated reasonably. I'm willing to close the proposal if it doesn't quite suite the community's standards.

kgryte commented 6 months ago

The project idea is not intended to match exact ECMAScript defined semantics for Array and TypedArray objects. Instead, the goal is provide "analogous" APIs which are conceptually similar to the methods on those objects. The current example is Array.prototype.slice and @stdlib/ndarray/slice. Those ideas are conceptually similar, but the latter is modified to match the semantics and constraints of multi-dimensional arrays. That should be similarly done for other proposed APIs. As a start, you could examine the APIs afforded by NumPy to better get a sense of multi-dimensional array design principles.