smartcorelib / smartcore

A comprehensive library for machine learning and numerical computing. The library provides a set of tools for linear algebra, numerical computing, optimization, and enables a generic, powerful yet still efficient approach to machine learning.
https://smartcorelib.org/
Apache License 2.0
671 stars 76 forks source link

Patch to version 0.4.0 #257

Closed Mec-iS closed 4 months ago

Mec-iS commented 1 year ago

Collecting changes for version 0.4.0

New expected behaviour

see #256

Change logs

codecov-commenter commented 1 year ago

Codecov Report

Attention: Patch coverage is 54.16667% with 33 lines in your changes are missing coverage. Please review.

Project coverage is 45.49%. Comparing base (80a93c1) to head (3fcd722).

Files Patch % Lines
src/linalg/basic/matrix.rs 57.35% 29 Missing :warning:
src/algorithm/neighbour/bbd_tree.rs 0.00% 2 Missing :warning:
src/error/mod.rs 0.00% 2 Missing :warning:
Additional details and impacted files ```diff @@ Coverage Diff @@ ## development #257 +/- ## =============================================== - Coverage 46.44% 45.49% -0.96% =============================================== Files 85 85 Lines 7247 7254 +7 =============================================== - Hits 3366 3300 -66 - Misses 3881 3954 +73 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

Mec-iS commented 1 year ago

@morenol we need to fix Wasm tests

Mec-iS commented 1 year ago

releasing a new minor for these changes is not completely justified. We need to add some more implementations to this to justify a minor release.

Mec-iS commented 1 year ago

Also: probably, in accord with the library principles, probably it is better to panic early at initialization than returning a Result. smartcore is meant to be as stable as possible after compilation, using a Result may just move problems into runtime. So there is the option of just allow controlled panic!(..) instead of returning a Fail.

EdmundsEcho commented 1 year ago

To which function are you referring?

I get that if a config or something early is flawed, the app should exit with useful feedback. Elsewhere, the user can "opt-in" to a panic by using "unwrap".

Here is a link that might be useful https://doc.rust-lang.org/book/ch09-03-to-panic-or-not-to-panic.html. I see a line between if "dangerous/harmful" then panic, otherwise Err. Err is a valid answer given the input. Panic is not generally expected behavior. Furthermore "fail early" is not what we are talking about, it's more "at the time a mistake is discovered" what does the lib return.

As a user of the library, I would expect to be able to send commands over time. A user (or service), might "get it wrong". Does that mean the user cannot "try again"?

In the event I run this in Tokio, I would not want the thread to panic (very bad and tough to navigate), but rather give me the answer - that can include: Err could not process b/c data was wrong, a match branch I can recover from.

Runtime errors in py is the reason I moved to Rust. Returning Err is not a runtime error like we experience in py. It's only a decision on what to return - app crash/panic or Err.

More generally about runtime errors: by definition of the compile-time checks in Rust, they don't equate to py (incomparable).

Right now the lib is fragile because it doesn't have strong typing through and through (likely mostly on the result side of the process). It will get better over time where the app does not instantiate a bad state (like what we have with DenseMatrix in the 0.4). Until then let the user know with an Err. For instance getting results is a fragile process.

What do you think?

morenol commented 1 year ago

I also think that a Result is better than panic for this case. Also agree that we could delay the crates.io release until we have more things that we could add to a breaking change release

EdmundsEcho commented 1 year ago

I agree there is more that can be done for the next major release. I have some ideas for how to add to the changes for 0.4. Much of the ideas come from what Axum has done for http.

  1. Reporting

I tried to use the coefficients and intercept to output probabilities (not binary classes) and was surprised by how fragile the process was. e.g., get_row(0), get((0,0)) requires a lot of "under the hood" knowledge not to mention the gymnastics required to extract values from &X.

One way to go from a -> b -> c without the user being aware, is create what I call "transient types". Using the transient types you build implementations of std::convert::From. This way, the end user can do something like:

let coefficients: Vec<_> = lr.coefficients().into();
let intercept: f64 = lr.intercept().into();

// or 
let coefficients: Vec<_> = lr.coefficients().iter().collect();
let intercept: f64 = lr.intercept(); // casts to f64

There are other "best practices" for increasing the usability of each method by using traits such as AsRef, Borrow that I wonder might be worth looking through.

  1. Augment what is included in a report e.g., F stat. I'm not all familiar with the subtle differences between ML and traditional stat modeling. I come from the latter. The incremental effort to report on "common sense" "checks" that we should be interested in is likely good fodder for discussion. It might be a feature, or the user choosing to "see" more of what has already been computed in the default features.

  2. Move away from building your own underlying data structures for hosting and executing common math operations e.g., use ndarray. We are using argmin. We should do the same for the underlying data hosting structs.

  3. Implement "the builder" pattern, like what we experience in many Rust crates, e.g., build a http request. I see what we are doing here in a similar vein. This will augment the capacity to build a "default" for whatever analysis that is "more often" what you want. This would also require augmenting the documentation to fully explain the additional options (accelerated by pointing users to generically available descriptions of the concept on established stats/ml sites?).

  4. All in all, it might be useful to "draw a line" where you want to be able to rely on modules that emit Result vs other, internal private methods that only assert to convey a critical dependence, but that can be made "sound" due to internal, actively monitored assertions. In general, outside "monad", inside "pure".

For the next major after this, or the next after that :)), I would start thinking more about performance. I hit performance issues with 50k x 100 sample size. Part of this would include a review of when we borrow vs take ownership. Ironically, I wonder if there is too much effort to avoid copies. My main point is this, it's still too early to worry about performance in my view. Just note, we have room to improve when the time comes. Unlike python, it will actually make the code both better and more readable, not more convoluted :))

Finally, I've surveyed several packages to accomplish what I need for building stat models. It's a fragmented space with many "half-way" efforts. I like what you all have accomplished. I believe the core strength is assembly: data -> design -> results. I see Axum in smartcore. They assemble the best while focusing on a productive, type-safe, Rust user experience.

I hope this input is helpful. I spent a many years working in Rust. I'm happy to help however you all like.

Mec-iS commented 1 year ago

Thanks for your ideas. I am glad you found smartcore interesting and you plan to contribute more!

I believe the library can be very useful in different domains and I am now focusing into using the library for applications to see what it can do. My idea is to keep the library different from the others on multiple dimensions. I think there is already a lot of effort into making things work for "larger" high-level use-cases and that side-tracking this approach can be very interesting in the long term for domains that are maybe not in the "big data" scenario; polars and linfa are already doing great on that, we don't want to overlap too much with their use cases; please see the CONTRIBUTING. I am currently taking feedback from embedded developers and audio technologists mostly, focusing on real-time processing as an objective; so the use cases I have in mind are more in the context of low-level, small devices, highly portable, high-frequency computations. For example, this field of applications probably needs more something like the latest research on dimensionality reduction techniques than supporting billions of features, or else making ML algorithms more useful on tabular datasets. In this context user experience is a bit of a second-order problem but as far as resource usage stays and these levels (both in terms of time but maybe more important for small devices also space complexity) or improve I am in general OK with better APIs and more features.

It would be great to have more idiomatic Rust, I think let's start with the transient types and the builder pattern, after that we want probably to set up good profiling (we have a bench repo but it is very limited and difficult to run, it would be nice to have something like tarpaulin for profiling, that runs with CI/CD). Vlad, that created the library, is also of the idea that we should drop our custom traits system on arrays but this is something we can see later as far as we provide the bindings to ndarray. Also it would be nice to have finally fixed predict_proba #211, but I don't know the functionality well enough to know if it is a problem with numerical stability of the library or the implementation itself.

Welcome! @EdmundsEcho

EdmundsEcho commented 1 year ago

To better understand the design intent was implemented, may I ask

  1. what the benefits of using a trait Array2d instead of a concrete type that works optimally with the solver? e.g., why not use DenseMatrix and convert whatever users want to this standard?
  2. are all of the methods in the Array-related traits used internally? Or are they there to support pre-processing?
  3. how do you see the line between pre-processing / post-processing and smartcore?
  4. why not use argmin?
Mec-iS commented 1 year ago

To better understand the design intent was implemented, may I ask

1. what the benefits of using a trait `Array2d` instead of a concrete type that works optimally with the solver? e.g., why not use `DenseMatrix` and convert whatever users want to this standard?

2. are all of the methods in the `Array`-related traits used internally?  Or are they there to support pre-processing?

3. how do you see the line between pre-processing / post-processing and smartcore?

4. why not use argmin?
  1. I think it is because to take advantage of rust zero-cost abstractions and provide zero-copy slicing and iterators, also it is possible this way to implement other custom concrete types (like a SparseMatrix for example)
  2. They are used internally but they can also be used to define other types. I would keep the custom trait system as we may tend to have a no-dependency (or even a no_std) feature/fork at some point in the future
  3. the library always focused on linear algebra and matrices manipulation, the rest is left to the user. Probably we can add anything that is useful behind feature configurations but the core functionalities have to stay stand-alone for minimal installation.
  4. I guess it didn't exist at the time (2018), check discussions for other hypothesis of possible integrations. Again, using any library that is based on ndarray in the internals will make everything dependant on that, and I don't know if it's worth as we would just overlap other libraries' use cases and working patterns.
Mec-iS commented 4 months ago

@morenol please take a look

morenol commented 4 months ago

I think that we need to run cargo fmt