Pydantic validation settings

timonmerk commented 5 months ago

@toni-neurosc Sorry for opening this PR so soon, but I had a look at this branch and for the most part I can understand it. Would it be possible to adapt one example from the examples folder s.t. I could follow along and also help adapt the other examples / tests?

toni-neurosc commented 5 months ago

No problem at all, I'll get to it after lunch, the tests/examples were basically the next thing in line, I think I should be able to finish the bulk of it today

timonmerk commented 5 months ago

Hey @toni-neurosc, thanks so much for those changes. I saw you now already added some burst addition in this PR. Sorry for messing with that.

I now did a merge that resulted with conflicts, but I wanted to push my additions in nm_bursts as well to this branch.

There are some points I want to add. I looked at all the points that you marked with "TONI". There is one test currently failing: test_osc_features:test_fft_wrong_logtransform_param_init. Is there in Pydantic a method to also check after initialization that fields will be validated? I thought that model_validator(mode="after") would relate to that point. And actually you set this as a decorator in NMSettings.validate_settings, so I am not sure why this one doesn't catch it...

Regarding your comment about zero divisions, there is actually a test that checks for all values being zero or NaN in test_nan_values.py. So in this case there might be warnings, especially when zero values are log-transformed (e.g. in nm_oscillatory) which results in -np.inf. But this should be expected behavior.

There was also a bug (I think) in NMSettings.enable_all_features, I changed it now that all features are enabled and not disabled.

Apart from that, the changes are great. I would also start adapting the docstrings now that they match the params for each function.

toni-neurosc commented 5 months ago

Hello Timon, thanks for taking the time to review, it's a huge change and the more eyes we have on this the better. I'll answer all your questions, but I think most issues are already solved.

I now did a merge that resulted with conflicts, but I wanted to push my additions in nm_bursts as well to this branch.

Fixed! It was just a merge that kept both old and new code

There are some points I want to add. I looked at all the points that you marked with "TONI". There is one test currently failing: test_osc_features:test_fft_wrong_logtransform_param_init. Is there in Pydantic a method to also check after initialization that fields will be validated? I thought that model_validator(mode="after") would relate to that point. And actually you set this as a decorator in NMSettings.validate_settings, so I am not sure why this one doesn't catch it...

Good catch, I had some trouble with that too and I was experimenting on my local a bunch of possibilities, I'll explain so you can weigh in:

Pydantic validators by default only trigger on initialization, not in re-assignment of member variables.
This caused issues in the tests when we get some baseline configuration (e.g. set_fast_compute()) and then modify some specific values, expecting it to fail.
There are 2 solutions around this:
- We trigger validation manually in important checkpoints (e.g. Stream initialization, NMFeature initialization). This is the solution I chose, and the fix to the failed test you mention is to put back the line nm_oscillatory.FFT(settings, ch_names, sfreq) because I sneaked in a settings.validate() call in the OscillatoryFeature base class (among others).
- We set validate_assignment=True for the NMBaseModel base class. I discarded this solution because it made it very inconvenient to modify the settings in a step-wise manner (e.g. if you disable all the features with the intent of re-enabling just one, it would throw an error as soon as you disable all because the model validator expects at least 1 feature to be enabled). So I thought this was a potential source of headaches for the end-user and did not do it.

What do you think? I think that we should be safe by adding settings.validate() in all NMFeature constructors and in Stream initialization. In fact, the stream alone should be enough but since this is not a performance-sensitive part of the code better safe than sorry I guess.

Regarding your comment about zero divisions, there is actually a test that checks for all values being zero or NaN in test_nan_values.py. So in this case there might be warnings, especially when zero values are log-transformed (e.g. in nm_oscillatory) which results in -np.inf. But this should be expected behavior.

The warnings I was referring to were not in the tests but appear during normalization when I run the software on randomly generated data. I think it results from 0/0 divisions, maybe it's expected behaviors but idk.

There was also a bug (I think) in NMSettings.enable_all_features, I changed it now that all features are enabled and not disabled.

Yes, that was totally a bug on my side, I also fixed it on my local so now it's doubly fixed for good measure :P

Apart from that, the changes are great. I would also start adapting the docstrings now that they match the params for each function.

As soon as I push the "custom feature" feature I'm on it.

timonmerk commented 5 months ago

That's great! Thanks a lot for the explanations and for the fix. It makes sense to call the validation at later times when e.g. the run method was called starting for feature estimation.

On a smaller note, I think the BaseModel validate method is now the model_validate method right? https://docs.pydantic.dev/latest/api/base_model/#pydantic.BaseModel

Also, I think it would be great to use this PR also to get rid of the nm_settings.json' and replace it with anm_settings.yaml`. I wanted to do that for quite some time, but now it's finally interchangeable.

toni-neurosc commented 5 months ago

Hi Timon,

That's great! Thanks a lot for the explanations and for the fix. It makes sense to call the validation at later times when e.g. the run method was called starting for feature estimation.

On a smaller note, I think the BaseModel validate method is now the model_validate method right? https://docs.pydantic.dev/latest/api/base_model/#pydantic.BaseModel

Also, I think it would be great to use this PR also to get rid of the nm_settings.json' and replace it with anm_settings.yaml`. I wanted to do that for quite some time, but now it's finally interchangeable.

That's great! Thanks a lot for the explanations and for the fix. It makes sense to call the validation at later times when e.g. the run method was called starting for feature estimation.

On a smaller note, I think the BaseModel validate method is now the model_validate method right? https://docs.pydantic.dev/latest/api/base_model/#pydantic.BaseModel

Also, I think it would be great to use this PR also to get rid of the nm_settings.json' and replace it with anm_settings.yaml`. I wanted to do that for quite some time, but now it's finally interchangeable.

That's great! Thanks a lot for the explanations and for the fix. It makes sense to call the validation at later times when e.g. the run method was called starting for feature estimation.

Agree.

On a smaller note, I think the BaseModel validate method is now the model_validate method right? https://docs.pydantic.dev/latest/api/base_model/#pydantic.BaseModel

Yes, but also the validate() that I'm calling is one I implemented myself in the NMBaseModel class and that calls Pydantic model_validate(). So it's not the deprecated validate method from Pydantic itself.

Also, I think it would be great to use this PR also to get rid of the nm_settings.json' and replace it with anm_settings.yaml`. I wanted to do that for quite some time, but now it's finally interchangeable.

Done!

Additionally, I have a question about the parameter line_noise. I was trying to create a NotchSettings class the same we have settings classes for the rest of things, and I'm not sure if line_noise is something that relates to NotchSettings only, because I have seen that that the read_BIDS_data function also takes it. So could you explain briefly what line_noise represents? If it's something that's specific to the experiment, we can have it in NMSettings with it's current default of 50.

timonmerk commented 5 months ago

Hey @toni-neurosc, thanks for the explanations. I think I need to go through the nm_burst function in detail. But I get the overall concept, and trust also that the result is the same as the computation from before. Optimally there would be a test with the simplest for loop iterations that is very easy to understand (and maintain) but also very inefficient.

Regarding line_noise, that is the grid line noise. So in Europe 50, US 60 Hz. I think it needs to be passed to the run function also, same as sampling rate, since it depends on the specific recording. So I guess the tests could similarly be done with an assert as the sampling rate check here? https://github.com/neuromodulation/py_neuromodulation/blob/6777221a24566078ced3e44923888b1ae2d0caf8/py_neuromodulation/nm_stream_abc.py#L72

toni-neurosc commented 5 months ago

Hey @toni-neurosc, thanks for the explanations. I think I need to go through the nm_burst function in detail. But I get the overall concept, and trust also that the result is the same as the computation from before. Optimally there would be a test with the simplest for loop iterations that is very easy to understand (and maintain) but also very inefficient.

I did many checks, I think the only differences are causing by floating point rounding errors, you can check yourselfs the new and old results

Results from main branch: bursts_ref.csv
Results from the vectorized version: bursts_new.csv

Regarding line_noise, that is the grid line noise. So in Europe 50, US 60 Hz. I think it needs to be passed to the run function also, same as sampling rate, since it depends on the specific recording. So I guess the tests could similarly be done with an assert as the sampling rate check here?

https://github.com/neuromodulation/py_neuromodulation/blob/6777221a24566078ced3e44923888b1ae2d0caf8/py_neuromodulation/nm_stream_abc.py#L72

I see, so this is the baseline noise caused by the electrical supply... I was thinking about having it be in settings, but I guess it makes sense that you would want the settings to be agnostic to the specific machines, and instead having them be limited to data processing settings. So it makes sense to have it somewhere else, Stream initialization is fine I guess. But it's really only ever used by NotchFilter really, so I was finding it annoying to have to toss it around in function parameters all over the project, it's only ever used in line 143 of nm_filter.py but it appears in 38 places over 8 files, just being tossed around non_stop... I guess it'll stay like that for now 🤷

toni-neurosc commented 5 months ago

I added a couple more commits that reduced the computation time for sharpwaves by a further 56% (from 3.52s to 1.55s in a 10k sample x 5 channel artificial dataset).

Sadly, some of the refactoring I made broke Example 3. I could fix it by doing some dirty fixes but first I wanted to discuss a bit what that example is about, because the example is using internal private functions like _initialize_sw_features, inyecting data into sw_analyze, using signal.find_peaks directly, pulling the filtered data out of the sw_analyzer... It seems to me that in this particular example PyNM is not helping much and everything is being done with other modules. If we want PyNM to have the capability to easily calculate Sharpwave features and plot them, we should provide the appropriate interface.

The example_3 test will remain broken for the time being, everything else is passing.

Also, look at this beauty, Sharpwaves and Bursts at the same level and on par with Hjorth:

timonmerk commented 5 months ago

Hey @toni-neurosc, the concept of temporal waveform shape is overall very little used in the invasive electrophysiology community. Therefore, with this example we aimed to present the overall concept behind waveform shape, what characteristics can be computed, how they relate to movement, and also how those can be used for decoding.

I know that it used quite a "hacky" way of presenting the concept, and probably a better way is to do the convolution, peak finding etc. in the example itself without having the dependencies to the nm_sharpwave code.

So it would be great if you could also adapt the example to show the similar functionality with the speed improvements. I really appreciate the performance increase, but I definitely want to keep the example in the documentation for explainability.

timonmerk commented 5 months ago

Hey @toni-neurosc, I am getting an error with your optimized sharpwave module import:

from numpy.core._methods import _mean as np_mean ModuleNotFoundError: No module named 'numpy.core._methods'

Could that be again an OS-specific thing?

toni-neurosc commented 5 months ago

Hey @toni-neurosc, I am getting an error with your optimized sharpwave module import:

from numpy.core._methods import _mean as np_mean ModuleNotFoundError: No module named 'numpy.core._methods'

Could that be again an OS-specific thing?

Well no, what happened is that Numpy 2.0.0 release just 4 hours ago and that's what pip installed on the test server xD Weird coincidence. Probably we wouldn't even have noticed if I hadn't decided to use a numpy internal function to squeeze some performance lol. I'm gonna update Numpy on my local and fix it.

EDIT: found the relevant change, this caused the error: https://numpy.org/devdocs/numpy_2_0_migration_guide.html#numpy-core-namespace

toni-neurosc commented 5 months ago

Ok, @timonmerk , I fixed the Numpy issue, it was nothing, they just changed the name of the internal module files with the version change, no big deal, basically added a couple of underscore and it's working again.

Also a couple more changes:

Commit a2da5cb: I added verbose=False to the MNE connectivity classes so that it does not output to the console.
Commit a500e01: I dealt with the RuntimeError of 0/0 division that was being thrown by Numpy, it was not because of input data normalization (it's disabled by default in the settings) but because of feature normalization sometimes having 0 mean and variance. I basically did as sklearn does and when std=0 I replaced it with std=1. Could have also used sklearn StandardScaler but it's actually twice as slow.
Commit c003339: merged the latest changes in main to this branch, just a bit of housekeeping

toni-neurosc commented 5 months ago

Hey @toni-neurosc, the concept of temporal waveform shape is overall very little used in the invasive electrophysiology community. Therefore, with this example we aimed to present the overall concept behind waveform shape, what characteristics can be computed, how they relate to movement, and also how those can be used for decoding.

I know that it used quite a "hacky" way of presenting the concept, and probably a better way is to do the convolution, peak finding etc. in the example itself without having the dependencies to the nm_sharpwave code.

So it would be great if you could also adapt the example to show the similar functionality with the speed improvements. I really appreciate the performance increase, but I definitely want to keep the example in the documentation for explainability.

Ok, I will try to rework the example and maybe implement some functions to expose the result of calc_feature without having to basically manually run the steps of calc_feature in the example itself.

timonmerk commented 5 months ago

@toni-neurosc, I've fixed now the sharpwave example. It was much simpler than expected. I also had to adapt some dependencies, such that the numpy.core functions could be imported.

I also fixed the add feature example that it passes the automatic test, and added a function in nm_IO to read raw_data. This fixes #304.

So feel free to add the method adding a feature in a better way after stream initialization. Otherwise you can merge this PR. Many thanks again for the crazy amount of work adapting basically the whole package!

neuromodulation / py_neuromodulation

Pydantic validation settings #329