jessecusack / perturb

A tool for processing microstructure data from Rockland Scientific instruments
GNU General Public License v3.0
2 stars 1 forks source link

Refactor for parallelization #22

Closed mousebrains closed 1 year ago

mousebrains commented 1 year ago

A refactoring of the code, which includes support for the Parallel Computing Toolbox.

The single threaded refactoring is ~20-30% faster.

On my 10 core M1 MacBook the wall clock time for a parallelized job is about 15-25% of the single threaded job.

jessecusack commented 1 year ago

@mousebrains, excellent. What issues doe the refactor cover? Have you reduced the processed file sizes? Is the parallel aspect optional?

mousebrains commented 1 year ago

There is a CHANGELOG.md that starts to describe some of the changes.

No, the parallelization is not "optional." If you don't have Matlab's "Parallel Computing Toolbox," the code will run as a single process. If you do have Matlab's "Parallel Computing Toolbox," the code will start a ProcessPool, using the configuration the user has set, or Matlab determines. It is on my pallet to sort out how to have the "Parallel Computing Toolbox" installed, but to force parfor to not use it. But that is a low priority for now.

The file sizes are smaller, but only ~20%. Part of deep dive after getting the NetCDF file generation restored is to dig through each of the generated files and see what I want to keep/add/subtract. I'm very hesitant to change the output of odas_p2mat, but to save space I'll have to. The profiles are where I can potentially save the most space, but I want enough to debug the spectra.

jessecusack commented 1 year ago

@mousebrains we should use an arithmetic mean on the non-logarithmic values of epsilon when binning dissipation. Averaging the logarithms and then exponentiating is equivalent to computing the geometric mean. The geometric mean is not physical because it doesn't conserve energy.

Consider a 1 m3 volume of water labelled V1 and an adjacent 1 m3 of water labelled V2. Assume the dissipation rate of TKE in V1 is e1 = 1e-6 W/kg and in V2 is e2 = 1e-7 W/kg. Assume the density of this water is rho = 1000 kg/m3. The amount of energy dissipated in the total volume V1 + V2 in 1 second is e1 rho V1 + e2 rho V2 = 0.0011. If we average the dissipation rate using an arithmetic mean and do the same calculation, we get the same answer: 1/2 (e1 + e2) rho (V1 + V2) = 0.0011. If we use the geometric mean, we get a different answer sqrt(e1e2) rho (V1 + V2) = 6.3246e-04. If the volumes or densities of these two water masses were different, we'd need to use a weighted mean. In any case, the geometric mean does not give the right answer for the amount of energy dissipated!

By a similar argument, we should not be using the median to bin any data.

mousebrains commented 1 year ago

We'll have to have a discussion about this.

I understand your argument, and it makes sense when considering two different volumes.

At the shear probe level, we are measuring the same volume with two different sensors. To estimate the "true" value, it is easiest mathematically to work in the space where the distribution is normal. In this case, that is log normal. We're not trying to conserver energy, rather estimate the "true" energy.

Now for the caveat to what I say. To get the dissipation we're taking a lot of samples and calculating a spectrum over a volume of water, which we then use the Nasmyth curves to estimate the dissipation.

For binning, I'm fine with taking the arithmetic mean, but I'm pretty sure we'll want the geometric mean between shear probe dissipation estimates.

mousebrains commented 1 year ago

On median/mean, I give the user the option. For someone who deals with robust statistics, they want median, for others, they want mean. Having come out of a very noisy non-normal world, I believe very strongly in robust statistics.

jessecusack commented 1 year ago

Ok great that we're agreed on arithmetic mean for depth binning!

Averaging across simultaneous shear probe measurements is different, like you say, so I am prepared to believe that the goemetric mean might be appropriate in that case. Does Rolph address this in one of his paper?

As for median vs. mean. We should make the mean the default setting, since it is physical and we anticipate this package being used by physical oceanographers. I understand the worry about outliers, but we should be dealing most of that problem using despiking methods + thermal lag / mass corrections on the raw data (which I know we do already). By the way, do we despike raw conductivity? If not, I think we should...

mousebrains commented 1 year ago

All depth binning is done with the user specified method, which now defaults to mean. I still need to figure out how to deal with CTD time binning. CTD time binning is switch from using median to mean.

I added two papers. by Rolf in to the Papers directory that address the statistics of epsilon samples. I have implemented a prior version of the mathematics, but I need to update it for Rolf's current thinking.

No, JAC_C nor JAC_T are currently despiked. If you want to look at that using VMP data, that would be useful. There are two places this should happen, one in the mat2profile.m code and another in the ctd2binned.m code.

I have checking this branch the updates.

mousebrains commented 1 year ago

24 Fixed by update to bottom_crash_profiles.m

jessecusack commented 1 year ago

Please 'squish' merge this whenever you are ready @mousebrains. I connect a bunch of issues which should be auto-closed with the merge.

mousebrains commented 1 year ago

Push to main branch