Open LukeMathWalker opened 5 years ago
What do we have right now from that list?
The answer is somehow dependent on what we choose as our reference data structure. For now I'll refer to ndarray
given that we are still early days when it comes to dataframes.
Dimensionality reduction: SVD is available in ndarray-linalg
. I am not aware of any ready-to-use implementation of PCA, ICA and NMF in the rust ecosystem. Off the top of my head though (mind it's been a while since I have touched a linear algebra book), all required primitive routines should be available in ndarray-linalg
- can you help me here @termoshtt?
Scaling: they should all be implementable in terms of map
and statistics already implemented in ndarray-stats
and ndarray
(see summary statistics, quantiles)
Encoding: I have found this routine in rustlearn
for dictionary vectorization. There does not seem to be much either there or in rusty-machine. I foresee we will have to build these categorical encoders. Discretization could be implemented quite easily on top of the histogram functionality in ndarray-stats
.
Missing values: the main issue I foresee here is not really the implementation of the imputation mechanism - for those easy methods it's going to be trivial. The thorny issue is representing missing data (which Arrow does quite well!). In ndarray
you could now use arrays of Option
s, but performance would not be optimal: we would have to roll out masked arrays - they are somewhat under the radar but not there yet.
Missing values: the main issue I foresee here is not really the implementation of the imputation mechanism - for those easy methods it's going to be trivial. The thorny issue is representing missing data (which Arrow does quite well!). In ndarray you could now use arrays of Options, but performance would not be optimal: we would have to roll out masked arrays - they are somewhat under the radar but not there yet.
I wonder how open Rust language team(s) would be to a Missing
type... or if any discussions around this have happened? It would obviously be a long run to get a new native type into standard, especially something as controversial (and confusing) as this. But obviously the more closely aligned any kind of a Missing
type is with Rust language design the easier it will be to integrate between packages that need it.
That's a good list!
I'm not sure if it necessarily fits in 'preprocessing', but I would add tools for model selection:
If all estimator models implemented the same traits, we could use the same cross-validation framework over arbitrary learners. Included in this could be the idea of a common set of classification / regression metrics for model evaluation -- again, I'm not sure if this is exactly 'preprocessing' but does definitely cut across multiple areas of concern.
Another possible thing to add: pipeline management. I haven't used the sklearn pipeline tools personally, but some mechanism to let users easily pipe a set of transformation and an estimator together might be useful.
For missing data representation, I feel like that should be handled at the DataFrame level (especially since dataframe will likely be at least partially backed by Arrow, which already does this via null bitmask), and imputation handled in the preprocessing library.
The representation does get a bit tricky. I implemented a simple masked array in agnes
, but there are definite usability and performance issues:
Option
-like enum called Value
with variants Na
and Exists(&T)
. Unfortunately, this means iteration is heavier (wrapping each data reference with an extra object).Related, this NumPy missing-data proposal from 2011 is an interesting read reviewing a lot of the issues when implementing missing data.
Good list for starting point.
all required primitive routines should be available in ndarray-linalg
Current ndarray-linalg lacks components to implement PCA. As scikit-learn document says, we need both full-SVD and truncated SVD for implementing various type of PCA, but ndarray-linalg does not have truncated SVD. I am not familiar with ICA and NMF, but similar status with PCA I guess.
I think these are linalg and not limited to ML. ndarray-linalg can accept them.
but ndarray-linalg does not have truncated SVD.
Implementing the randomized truncated SVD solver would be quite useful. In scikit-learn it's the default solver for PCA and TruncatedSVD and is based on the paper by Halko, et al., 2009. I think in practice that will often be faster for ML applications than a full SVD solver.
Another topic is the support of sparse data. TruncatedSVD in scikit-learn is often used on sparse data. In rust such solver could be implemented e.g. on top of the sprs crate.
Missing values: the main issue I foresee here is not really the implementation of the imputation mechanism - for those easy methods it's going to be trivial. The thorny issue is representing missing data (which Arrow does quite well!). In ndarray you could now use arrays of Options, but performance would not be optimal: we would have to roll out masked arrays - they are somewhat under the radar but not there yet.
I wonder how open Rust language team(s) would be to a
Missing
type... or if any discussions around this have happened? It would obviously be a long run to get a new native type into standard, especially something as controversial (and confusing) as this. But obviously the more closely aligned any kind of aMissing
type is with Rust language design the easier it will be to integrate between packages that need it.
I think that type is already in the Rust language - it's Option
, or any equivalent enum, but as @jblondin says it does imply a compromise when it comes to performance in iteration.
The way Arrow handles it, using a separate bit mask, should allow it to be as close as possible to the speed of a normal array while giving the user the impression that everything is actually wrapped into an Option
, in the Rust world.
I agree it should be handled at the DataFrame level, but it would be nice to have somehow similar support also in the ndarray library. I am going to read through the link you posted @jblondin - that NumPy's discussion looks pretty interesting!
Another possible thing to add: pipeline management. I haven't used the sklearn pipeline tools personally, but some mechanism to let users easily pipe a set of transformation and an estimator together might be useful.
Definitely - defining a Pipeline
trait of some sort should be one of our focuses. It makes it easier then to work on hyperparameter optimization, because you can do it directly at the pipeline level, allowing the user to tune hyperparameters living outside the model itself.
Current ndarray-linalg lacks components to implement PCA. As scikit-learn document says, we need both full-SVD and truncated SVD for implementing various type of PCA, but ndarray-linalg does not have truncated SVD. I am not familiar with ICA and NMF, but similar status with PCA I guess. I think these are linalg and not limited to ML. ndarray-linalg can accept them.
Gotcha - I think it makes sense to spec out exactly what we need for each of those algorithms and then we can start working on implementing the required primitives.
Same goes with sparse matrices - I haven't checked the current status of sprs
, but it should definitely be something we need to get covered. Would you like to join the conversation @vbarrielle?
I managed to have a look at the NumPy document - if my understanding is correct, the bitpattern
methodology that they describe is almost equivalent to having an ArrayBase
using a data type that implements something similar to the MaybeNaN
trait that we defined in ndarray-stats
.
The issue I see with our MaybeNaN
trait is that we are trying to handle at the same time Option<T>
and floats (to remove the possibility that any of the values in the array was a legitimate NaN value, instead of NA). If we just want to deal with NA values, then a bare Option
should do it, with the proper support of companion traits to decide how to handle those missing values.
While the mask methodology, as described in the design document instead of who it is implemented in numpy.ma
, is very close to what Arrow is doing.
Hey, I just wanted to add that MFCC/MFSC are common preprocessing steps for machine learning in the context of audio processing. If you want to build an ASR system then this decorrelates your pitch and formant functions and reduces the data complexity. They are also used in room classification, instrument detection, actually anything that has to do with natural sound sources. A crate which does the windowing, transformation etc. would be great!
I am not very familiar with the problem space @bytesnake - could you provide some references and resources we can have a look at?
Here are some introductions to MFCCs
I wrote a MFCC library for a class recently, you can find it here https://github.com/bytesnake/mfcc
Hey there, Super interested in talking about ML in Rust ! I'm working on ASR :) Regarding standard ASR systems (Automatic Speech Recognition), they rely on language models that are stored as wFST (Weighted Finite States Transducer). The same goes for what is called the decoding graph. For the past months, I have been re-implementing in rust an FST library --> https://github.com/Garvys/rustfst It is a library for constructing, combining, optimizing, and searching weighted finite-state transducers (FSTs). Main application is ASR but also used a lot in NLP. Hope it is the right place to talk about it :)
Context: see https://github.com/rust-ml/discussion/issues/1.
This is meant to be a list of functionality we want to implement (a roadmap?) - I have refrained from including more sophisticated methods, limiting to what I believe to be a set of "core" routines we should absolutely offer. For each piece of functionality I'd like to document what is already available in the Rust ecosystem.
This is meant to be a WIP list, so feel free to chip in @jblondin and edit/add things I might have missed.