Open acmoudleysa opened 6 months ago
Although I have to mention that, spectroscopic data are not of dtype
int
.
The other classes like (SNV) faces similar problem with dtype. If you agree, we can just edit the check_input.py
and convert them to float
. This way we won't have to make changes everywhere.
Hey @acmoudleysa! ๐ Thanks a lot for bringing this up, and my sincere apologies for the delay in getting back to you! I'm thrilled to dive into this and explore some solutions together. ๐ Let me do a bit of digging, and I'll come back to you with some ideas. Once we've got a plan in place, we can jump right into implementation! ๐
Out of curiosity, what led you to stumble upon spectra with int? ๐ช
Thanks for flagging this and looking forward to our next steps together! ๐
Hi @paucablop. Well, I was working in NMR data, and forgot to change the dtype
to float (the intensities are integers).
Thank you! After some investigation, I've come to the conclusion that the most effective approach would be to validate the data during the fitting and transform process for each method. I'm in the process of transitioning away from using the check_input.py
functions and instead incorporating X = self._validate_data(X)
into our workflow. My suggestion is to handle this validation similarly to how it's done in other scikit-learn
preprocessing methods:
For the fit()
method:
X = self._validate_data(X, dtype=FLOAT_DTYPES)
For the transform()
method we could substitute the current:
X = check_input(X)
X_ = X.copy()
by something like this:
X_ = self._validate_data(X, dtype=FLOAT_DTYPES, copy=True)
This adjustment should be applied to both the fit and transform methods across all preprocessing functions in chemotools
. Let me know if you're interested in contributing to this task and I'll happily assign it to you and we can start discussing the implementation and testing!
Again, many thanks for your help with this :happy:
Hi @paucablop. Yes you can assign the task to me. I am happy to contribute.
Hi @acmoudleysa awesome, thanks a lot for the help :v:
I have assigned you to the task, and opened a branch associated to this issue for the development (87-baselineshift-fails-on-2d-array-with-dtype=int). :star_struck:
Following our previous discussion we should update both the .fit()
and the .transform()
methods for all the functions in the project.
After you have implemented the changes, we should also make sure to implement the corresponding tests in the test folder. Since most of the testing is handled the scikit-learn
testing check for each function, I think it is enough if we only write a single test for one function (of your choice) proving that when an int
array is given as input, we obtain an np.float
array after applying the .transform()
method.
If you have any questions or doubts, please do not hesitate in reaching out! :smile:!
Once again, thank you very much for your help! :muscle:
Thank you! After some investigation, I've come to the conclusion that the most effective approach would be to validate the data during the fitting and transform process for each method. I'm in the process of transitioning away from using the
check_input.py
functions and instead incorporatingX = self._validate_data(X)
into our workflow. My suggestion is to handle this validation similarly to how it's done in otherscikit-learn
preprocessing methods:For the
fit()
method:X = self._validate_data(X, dtype=FLOAT_DTYPES)
For the
transform()
method we could substitute the current:X = check_input(X) X_ = X.copy()
by something like this:
X_ = self._validate_data(X, dtype=FLOAT_DTYPES, copy=True)
This adjustment should be applied to both the fit and transform methods across all preprocessing functions in
chemotools
. Let me know if you're interested in contributing to this task and I'll happily assign it to you and we can start discussing the implementation and testing!Again, many thanks for your help with this :happy:
Hey @paucablop! I was going through the source code of _validate_data
method of BaseEstimator
class, and I realized that on this line, the _check_n_features(X, reset=reset)
does the check for the number of features during transform
, which means we might not need these lines in our transform
methods (I will check if all of the preprocessing classes have this line):
# Check that the number of features is the same as the fitted data
if X_.shape[1] != self.n_features_in_:
raise ValueError(f"Expected {self.n_features_in_} features but got {X_.shape[1]}")
The trick is to self._validate_data(..., reset=True)
inside the fit
method, which sets the n_features_in_
and to self._validate_data(..., reset=False)
inside transform
which makes sure the feature length matches (Here)
Hi,
@acmoudleysa
The trick is to self._validate_data(..., reset=True) inside the fit method, which sets the n_featuresin and to self._validate_data(..., reset=False) inside transform which makes sure the feature length matches (Here)
I'm currently working on #44 and there I already replaced it. I used BaseEstimator._validate_data(..., reset=True)
to be more concise about which of the parent classes is used because in the updated implementations, the Whittaker smoother and the two PLS-baseline algorithms inherit from 4 classes (see this example).
@paucablop @acmoudleysa Is there something standing against a specific type conversion that relies on a very private class variable? For example, for the Whittaker smoother and the two PLS-baseline algorithms, it is numerically instable to go for, e.g., np.float32
for large signals. Thus the conversion has to go for float64
. For some other estimators, it could make sense to only go for np.float32
for performance because spectra usually don't have more significant digits (if at all). Because then, a class variable could be used to solved this like
class Estimator(...):
"""
Estimates something
"""
__dtype_work: Type = np.float64 # this is a class variable
def __init__(...):
...
def fit(X, ...):
...
if X.dtype == self.__dtype_work:
BaseEstimator._validate_data(X, reset=True)
else:
# X = X.astype(self.__dtype_work) might be required for some estimators if X is used further in fit
BaseEstimator._validate_data(X.astype(self.__dtype_work), reset=True)
...
@paucablop @acmoudleysa Regarding the tests, do you think it would make sense to just test all estimators with pystest.mark.parametrize
and specify the estimator involved and the Arrays of integers to fit and transform? If there is only a test for a single estimator this feels a bit like going against the idea behind automated tests. A test could be as simple as
# the following spectra could also be fixtures
SPECTRUM_1D_00 = ... # some 1D spectrum of integers
SPECTRUM_1D_01 = ... # another 1D spectrum of integers
SPECTRUM_2D_00 = ... # some 2D spectrum of integers
SPECTRUM_2D_01 = ... # another 2D spectrum of integers
@pytest.mark.parametrize(
"combination", [
(WhittakerSmooth, SPECTRUM_1D_00), # could also hold an expected, but the test's purpose is not correctness
(WhittakerSmooth, SPECTRUM_2D_00),
...
(NorrisWilliams, SPECTURM_1D_01),
(NorrisWilliams, SPECTRUM_2D_01),
...
)
def test_fit_transform_with_int(combination):
estimator, X = combination
estimator.fit_transform(X) # if this method runs, the test is passed because it didn't raise an error
Just some ideas. Maybe you have some feedback or further suggestions ๐
Hi,
I tested the new version of ArPLS
refactored for #44 and it works like a charm for dtype=int
without any prior conversion outside the class:
One can see the discretization effect due to the integer representation, but the baseline correction still worked because it internally used a float64
representation to avoid numerical difficulties.
It uses the class variable approach to do the conversion to the type required for the transformation.
Best regards
BaselineShift class fails on an
int
dtype 2d array. The issue lies on the implementation here. The issue is this:There are several fixes to this, including an easy one which is to just assign the dtype of the array being transformed to
float
. Or we can usenumpy.apply_along_axis
which fixes the issue we were having by first creating an array of zeros (which by default isfloat
dtype) and then does the replacement thing.I don't know if this is a major issue, but could be useful regardless of its importance.
Update: I see that other classes of augmentation have similar implementation and have the same issue. Should I make changes and do a pull request? Let me know (with the solution of your choice)