kujaku11 / mth5

Exchangeable and archivable format for magnetotelluric time series to better serve the community through FAIR principles.
https://mth5.readthedocs.io/en/latest/index.html
MIT License
16 stars 7 forks source link

Compute Sample Rate - Expensive Median Computation #241

Closed dequiroga closed 1 month ago

dequiroga commented 1 month ago

The use of ChannelTS.compute_sample_rate is significantly slow. (Significant when used repeatedly in large datasets)

Once in a while, if the sample rate is not defined, MTH5 computes it from the time array, and it uses the median of the time differences. When using this functionality repeatedly (for large datasets) this results in a significant time inefficiency.

I have been having the same results (but substantially faster) using the mode. e.g.:

Take the mode?

best_dt, counts = scipy.stats.mode(dt_array)

Another idea would be doing some sort of weighted average, but this does not seem to be as robust... e.g.:

Weighted average of the unique dt occurrences (weighted by counts)?

n_samples = len(dt_array) uniques, counts = np.unique(dt_array, return_counts=True) weights = counts / n_samples best_dt = np.sum(uniques * weights)

kujaku11 commented 1 month ago

@dequiroga This is great, we've been dealing with this problem for a while now, but hadn't used much big data. I like your method of using scipy.stats, that probably is much faster than np.median.

Could you draft a pull request?

dequiroga commented 1 month ago

Will do, I have a branch with the fix but I cant seem to push without permission! I am happy to push and draft a PR if you give me access.

kkappler commented 1 month ago

@kujaku11 can you please add permission for @dequiroga to PR into mth5?

dequiroga commented 1 month ago

Thanks @kujaku11, see the draft PR #246