Preprocessing - What do we need? What do we have?

LukeMathWalker commented 5 years ago

Context: see https://github.com/rust-ml/discussion/issues/1.

This is meant to be a list of functionality we want to implement (a roadmap?) - I have refrained from including more sophisticated methods, limiting to what I believe to be a set of "core" routines we should absolutely offer. For each piece of functionality I'd like to document what is already available in the Rust ecosystem.

This is meant to be a WIP list, so feel free to chip in @jblondin and edit/add things I might have missed.

Dimensionality reduction:
- Principal Component Analysis (PCA)
- Singular Value Decomposition (SVD)
- Independent component analysis (ICA)
- Non-negative matrix factorization (NMF)
Scaling:
- Standard scaling (zero mean and unit variance)
- Range scaling (specify min and max)
Encoding:
- One hot encoding
- Ordinal encoding
- Discretization of continuous variables
Missing values:
- Naive imputation (constant value or using common statistics)

LukeMathWalker commented 5 years ago

What do we have right now from that list? The answer is somehow dependent on what we choose as our reference data structure. For now I'll refer to ndarray given that we are still early days when it comes to dataframes.

Dimensionality reduction: SVD is available in ndarray-linalg. I am not aware of any ready-to-use implementation of PCA, ICA and NMF in the rust ecosystem. Off the top of my head though (mind it's been a while since I have touched a linear algebra book), all required primitive routines should be available in ndarray-linalg - can you help me here @termoshtt?
Scaling: they should all be implementable in terms of map and statistics already implemented in ndarray-stats and ndarray (see summary statistics, quantiles)
Encoding: I have found this routine in rustlearn for dictionary vectorization. There does not seem to be much either there or in rusty-machine. I foresee we will have to build these categorical encoders. Discretization could be implemented quite easily on top of the histogram functionality in ndarray-stats.
Missing values: the main issue I foresee here is not really the implementation of the imputation mechanism - for those easy methods it's going to be trivial. The thorny issue is representing missing data (which Arrow does quite well!). In ndarray you could now use arrays of Options, but performance would not be optimal: we would have to roll out masked arrays - they are somewhat under the radar but not there yet.

jbowles commented 5 years ago

Missing values: the main issue I foresee here is not really the implementation of the imputation mechanism - for those easy methods it's going to be trivial. The thorny issue is representing missing data (which Arrow does quite well!). In ndarray you could now use arrays of Options, but performance would not be optimal: we would have to roll out masked arrays - they are somewhat under the radar but not there yet.

I wonder how open Rust language team(s) would be to a Missing type... or if any discussions around this have happened? It would obviously be a long run to get a new native type into standard, especially something as controversial (and confusing) as this. But obviously the more closely aligned any kind of a Missing type is with Rust language design the easier it will be to integrate between packages that need it.

jblondin commented 5 years ago

That's a good list!

I'm not sure if it necessarily fits in 'preprocessing', but I would add tools for model selection:

Cross-validation (and general data-splitting functionality)
Grid search hyperparameter selection

If all estimator models implemented the same traits, we could use the same cross-validation framework over arbitrary learners. Included in this could be the idea of a common set of classification / regression metrics for model evaluation -- again, I'm not sure if this is exactly 'preprocessing' but does definitely cut across multiple areas of concern.

Another possible thing to add: pipeline management. I haven't used the sklearn pipeline tools personally, but some mechanism to let users easily pipe a set of transformation and an estimator together might be useful.

jblondin commented 5 years ago

For missing data representation, I feel like that should be handled at the DataFrame level (especially since dataframe will likely be at least partially backed by Arrow, which already does this via null bitmask), and imputation handled in the preprocessing library.

The representation does get a bit tricky. I implemented a simple masked array in agnes, but there are definite usability and performance issues:

How should iteration work? I used a Option-like enum called Value with variants Na and Exists(&T). Unfortunately, this means iteration is heavier (wrapping each data reference with an extra object).
Can we implement Index / IndexMut? These traits return references to an underlying data type, so we can't just wrap the value when called
More issues that I can't recall at the moment -- I definitely had a long list of them when working on it! :smile:

Related, this NumPy missing-data proposal from 2011 is an interesting read reviewing a lot of the issues when implementing missing data.

termoshtt commented 5 years ago

Good list for starting point.

all required primitive routines should be available in ndarray-linalg

Current ndarray-linalg lacks components to implement PCA. As scikit-learn document says, we need both full-SVD and truncated SVD for implementing various type of PCA, but ndarray-linalg does not have truncated SVD. I am not familiar with ICA and NMF, but similar status with PCA I guess.

I think these are linalg and not limited to ML. ndarray-linalg can accept them.

rth commented 5 years ago

but ndarray-linalg does not have truncated SVD.

Implementing the randomized truncated SVD solver would be quite useful. In scikit-learn it's the default solver for PCA and TruncatedSVD and is based on the paper by Halko, et al., 2009. I think in practice that will often be faster for ML applications than a full SVD solver.

Another topic is the support of sparse data. TruncatedSVD in scikit-learn is often used on sparse data. In rust such solver could be implemented e.g. on top of the sprs crate.

LukeMathWalker commented 5 years ago

Missing values: the main issue I foresee here is not really the implementation of the imputation mechanism - for those easy methods it's going to be trivial. The thorny issue is representing missing data (which Arrow does quite well!). In ndarray you could now use arrays of Options, but performance would not be optimal: we would have to roll out masked arrays - they are somewhat under the radar but not there yet.

I wonder how open Rust language team(s) would be to a Missing type... or if any discussions around this have happened? It would obviously be a long run to get a new native type into standard, especially something as controversial (and confusing) as this. But obviously the more closely aligned any kind of a Missing type is with Rust language design the easier it will be to integrate between packages that need it.

I think that type is already in the Rust language - it's Option, or any equivalent enum, but as @jblondin says it does imply a compromise when it comes to performance in iteration. The way Arrow handles it, using a separate bit mask, should allow it to be as close as possible to the speed of a normal array while giving the user the impression that everything is actually wrapped into an Option, in the Rust world. I agree it should be handled at the DataFrame level, but it would be nice to have somehow similar support also in the ndarray library. I am going to read through the link you posted @jblondin - that NumPy's discussion looks pretty interesting!

Another possible thing to add: pipeline management. I haven't used the sklearn pipeline tools personally, but some mechanism to let users easily pipe a set of transformation and an estimator together might be useful.

Definitely - defining a Pipeline trait of some sort should be one of our focuses. It makes it easier then to work on hyperparameter optimization, because you can do it directly at the pipeline level, allowing the user to tune hyperparameters living outside the model itself.

Current ndarray-linalg lacks components to implement PCA. As scikit-learn document says, we need both full-SVD and truncated SVD for implementing various type of PCA, but ndarray-linalg does not have truncated SVD. I am not familiar with ICA and NMF, but similar status with PCA I guess. I think these are linalg and not limited to ML. ndarray-linalg can accept them.

Gotcha - I think it makes sense to spec out exactly what we need for each of those algorithms and then we can start working on implementing the required primitives. Same goes with sparse matrices - I haven't checked the current status of sprs, but it should definitely be something we need to get covered. Would you like to join the conversation @vbarrielle?

LukeMathWalker commented 5 years ago

I managed to have a look at the NumPy document - if my understanding is correct, the bitpattern methodology that they describe is almost equivalent to having an ArrayBase using a data type that implements something similar to the MaybeNaN trait that we defined in ndarray-stats. The issue I see with our MaybeNaN trait is that we are trying to handle at the same time Option<T> and floats (to remove the possibility that any of the values in the array was a legitimate NaN value, instead of NA). If we just want to deal with NA values, then a bare Option should do it, with the proper support of companion traits to decide how to handle those missing values. While the mask methodology, as described in the design document instead of who it is implemented in numpy.ma, is very close to what Arrow is doing.

bytesnake commented 5 years ago

Hey, I just wanted to add that MFCC/MFSC are common preprocessing steps for machine learning in the context of audio processing. If you want to build an ASR system then this decorrelates your pitch and formant functions and reduces the data complexity. They are also used in room classification, instrument detection, actually anything that has to do with natural sound sources. A crate which does the windowing, transformation etc. would be great!

LukeMathWalker commented 5 years ago

I am not very familiar with the problem space @bytesnake - could you provide some references and resources we can have a look at?

bytesnake commented 5 years ago

Here are some introductions to MFCCs

https://medium.com/prathena/the-dummys-guide-to-mfcc-aceab2450fd
http://practicalcryptography.com/miscellaneous/machine-learning/guide-mel-frequency-cepstral-coefficients-mfccs/
https://musicinformationretrieval.com/mfcc.html
L. Rabiner, B. H. Juang: Fundamentals of Speech Recognition, Prentice Hall, Englewood Cliffs, NJ, 1993.
Mel Frequency Cepstral Coefficients for Music Modeling.

I wrote a MFCC library for a class recently, you can find it here https://github.com/bytesnake/mfcc

Garvys commented 5 years ago

Hey there, Super interested in talking about ML in Rust ! I'm working on ASR :) Regarding standard ASR systems (Automatic Speech Recognition), they rely on language models that are stored as wFST (Weighted Finite States Transducer). The same goes for what is called the decoding graph. For the past months, I have been re-implementing in rust an FST library --> https://github.com/Garvys/rustfst It is a library for constructing, combining, optimizing, and searching weighted finite-state transducers (FSTs). Main application is ASR but also used a lot in NLP. Hope it is the right place to talk about it :)

rust-ml / classical-ml-discussion

Preprocessing - What do we need? What do we have? #1