BaselineShift fails on 2d array with `dtype=int`

paucablop / chemotools

Integrate your chemometric tools with the scikit-learn API 🧪 🤖

https://paucablop.github.io/chemotools/

MIT License

45 stars 6 forks source link

BaselineShift fails on 2d array with `dtype=int` #87

Open acmoudleysa opened 6 months ago

acmoudleysa commented 6 months ago

BaselineShift class fails on an int dtype 2d array. The issue lies on the implementation here. The issue is this:

x = np.array([[1,2,3]]) # dtype int
new_x = np.add(x, 0.5) # dtype float
x[0] = new_x 
x >>> np.array([[1,2,3]])  # you cannot replace arrays with different dtypes

There are several fixes to this, including an easy one which is to just assign the dtype of the array being transformed to float. Or we can use numpy.apply_along_axis which fixes the issue we were having by first creating an array of zeros (which by default is float dtype) and then does the replacement thing.

I don't know if this is a major issue, but could be useful regardless of its importance.

Update: I see that other classes of augmentation have similar implementation and have the same issue. Should I make changes and do a pull request? Let me know (with the solution of your choice)

acmoudleysa commented 6 months ago

Although I have to mention that, spectroscopic data are not of dtype int.

The other classes like (SNV) faces similar problem with dtype. If you agree, we can just edit the check_input.py and convert them to float. This way we won't have to make changes everywhere.

paucablop commented 6 months ago

Hey @acmoudleysa! 🌟 Thanks a lot for bringing this up, and my sincere apologies for the delay in getting back to you! I'm thrilled to dive into this and explore some solutions together. 😊 Let me do a bit of digging, and I'll come back to you with some ideas. Once we've got a plan in place, we can jump right into implementation! 🚀

Out of curiosity, what led you to stumble upon spectra with int? 💪

Thanks for flagging this and looking forward to our next steps together! 🎉

acmoudleysa commented 5 months ago

Hi @paucablop. Well, I was working in NMR data, and forgot to change the dtype to float (the intensities are integers).

paucablop commented 5 months ago

Thank you! After some investigation, I've come to the conclusion that the most effective approach would be to validate the data during the fitting and transform process for each method. I'm in the process of transitioning away from using the check_input.py functions and instead incorporating X = self._validate_data(X) into our workflow. My suggestion is to handle this validation similarly to how it's done in other scikit-learn preprocessing methods:

For the fit() method:

X = self._validate_data(X, dtype=FLOAT_DTYPES)

For the transform() method we could substitute the current:

X = check_input(X)
X_ = X.copy()

by something like this:

X_ = self._validate_data(X, dtype=FLOAT_DTYPES, copy=True)

This adjustment should be applied to both the fit and transform methods across all preprocessing functions in chemotools. Let me know if you're interested in contributing to this task and I'll happily assign it to you and we can start discussing the implementation and testing!

Again, many thanks for your help with this :happy:

acmoudleysa commented 5 months ago

Hi @paucablop. Yes you can assign the task to me. I am happy to contribute.

paucablop commented 5 months ago

Hi @acmoudleysa awesome, thanks a lot for the help :v:

I have assigned you to the task, and opened a branch associated to this issue for the development (87-baselineshift-fails-on-2d-array-with-dtype=int). :star_struck:

Scope

Following our previous discussion we should update both the .fit() and the .transform() methods for all the functions in the project.

Testing

After you have implemented the changes, we should also make sure to implement the corresponding tests in the test folder. Since most of the testing is handled the scikit-learn testing check for each function, I think it is enough if we only write a single test for one function (of your choice) proving that when an int array is given as input, we obtain an np.float array after applying the .transform() method.

If you have any questions or doubts, please do not hesitate in reaching out! :smile:!

Once again, thank you very much for your help! :muscle:

acmoudleysa commented 5 months ago

Thank you! After some investigation, I've come to the conclusion that the most effective approach would be to validate the data during the fitting and transform process for each method. I'm in the process of transitioning away from using the check_input.py functions and instead incorporating X = self._validate_data(X) into our workflow. My suggestion is to handle this validation similarly to how it's done in other scikit-learn preprocessing methods:

For the fit() method:
X = self._validate_data(X, dtype=FLOAT_DTYPES)
For the transform() method we could substitute the current:
X = check_input(X)
X_ = X.copy()
by something like this:
X_ = self._validate_data(X, dtype=FLOAT_DTYPES, copy=True)
This adjustment should be applied to both the fit and transform methods across all preprocessing functions in chemotools. Let me know if you're interested in contributing to this task and I'll happily assign it to you and we can start discussing the implementation and testing!

Again, many thanks for your help with this :happy:

Hey @paucablop! I was going through the source code of _validate_data method of BaseEstimator class, and I realized that on this line, the _check_n_features(X, reset=reset) does the check for the number of features during transform, which means we might not need these lines in our transform methods (I will check if all of the preprocessing classes have this line):

        # Check that the number of features is the same as the fitted data
        if X_.shape[1] != self.n_features_in_:
            raise ValueError(f"Expected {self.n_features_in_} features but got {X_.shape[1]}")

The trick is to self._validate_data(..., reset=True) inside the fit method, which sets the n_features_in_ and to self._validate_data(..., reset=False) inside transform which makes sure the feature length matches (Here)

MothNik commented 4 months ago

Hi,

@acmoudleysa

The trick is to self._validate_data(..., reset=True) inside the fit method, which sets the n_featuresin and to self._validate_data(..., reset=False) inside transform which makes sure the feature length matches (Here)

I'm currently working on #44 and there I already replaced it. I used BaseEstimator._validate_data(..., reset=True) to be more concise about which of the parent classes is used because in the updated implementations, the Whittaker smoother and the two PLS-baseline algorithms inherit from 4 classes (see this example).

@paucablop @acmoudleysa Is there something standing against a specific type conversion that relies on a very private class variable? For example, for the Whittaker smoother and the two PLS-baseline algorithms, it is numerically instable to go for, e.g., np.float32 for large signals. Thus the conversion has to go for float64. For some other estimators, it could make sense to only go for np.float32 for performance because spectra usually don't have more significant digits (if at all). Because then, a class variable could be used to solved this like

class Estimator(...):
    """
    Estimates something
    """

    __dtype_work: Type = np.float64 # this is a class variable

    def __init__(...):
        ...

    def fit(X, ...):
        ...
        if X.dtype == self.__dtype_work:
            BaseEstimator._validate_data(X, reset=True)
        else:
            # X =  X.astype(self.__dtype_work) might be required for some estimators if X is used further in fit
            BaseEstimator._validate_data(X.astype(self.__dtype_work), reset=True)

    ...

@paucablop @acmoudleysa Regarding the tests, do you think it would make sense to just test all estimators with pystest.mark.parametrize and specify the estimator involved and the Arrays of integers to fit and transform? If there is only a test for a single estimator this feels a bit like going against the idea behind automated tests. A test could be as simple as

# the following spectra could also be fixtures
SPECTRUM_1D_00 = ... # some 1D spectrum of integers
SPECTRUM_1D_01 = ... # another 1D spectrum of integers
SPECTRUM_2D_00 = ... # some 2D spectrum of integers
SPECTRUM_2D_01 = ... # another 2D spectrum of integers

@pytest.mark.parametrize(
    "combination", [
        (WhittakerSmooth, SPECTRUM_1D_00), # could also hold an expected, but the test's purpose is not correctness
        (WhittakerSmooth, SPECTRUM_2D_00),
        ...
        (NorrisWilliams, SPECTURM_1D_01),
        (NorrisWilliams, SPECTRUM_2D_01),
        ...
)
def test_fit_transform_with_int(combination):
    estimator, X = combination
    estimator.fit_transform(X) # if this method runs, the test is passed because it didn't raise an error

Just some ideas. Maybe you have some feedback or further suggestions 🙃

MothNik commented 3 months ago

Hi, I tested the new version of ArPLS refactored for #44 and it works like a charm for dtype=int without any prior conversion outside the class:

ArPLS_Dtype_Integer

One can see the discretization effect due to the integer representation, but the baseline correction still worked because it internally used a float64 representation to avoid numerical difficulties.

It uses the class variable approach to do the conversion to the type required for the transformation.

Best regards