mdaeron / D47crunch

Python library for processing and standardizing carbonate clumped-isotope analyses, from low-level data out of a dual-inlet mass spectrometer to final, “absolute” Δ47, Δ48, and Δ49 values with fully propagated analytical error estimates.
Other
8 stars 3 forks source link

`standardize` is slow with large datasets #8

Closed japhir closed 2 years ago

japhir commented 2 years ago

This seems to work! The standardize step is pretty slow (several minutes for a 552 samples and 1959 anchors) but I have no idea how fast it should be, and computing session ETFs with fancy maths is also very slow in my R version. It didn't throw an immediate error about the wrong Sample names this time and seems to have worked out okay!

Originally posted by @japhir in https://github.com/mdaeron/D47crunch/issues/6#issuecomment-916956404

I'm now running it on my full dataset of samples and standards, which consists of about 19000 aliquots in total. It's been running for half an hour or so now, and I have no idea whether I should let it keep on chugging or if I should cancel it and run it only for those subsets of the data that I want to do this for.

Perhaps we can improve this by implementing parallelization? Or showing a progress bar so that users know how long the wait will likely last?

mdaeron commented 2 years ago

That is probably a lmfit issue. I don't know for sure if parallelization is an option for the Trust Region Reflective method used here, but I doubt it (IIRC it's an iterative process). Same goes for a progress bar, because the algorithm doesn't know in advance how long it will be looking for a local minimum.

Remember that the difficulty of fitting the model increases dramatically with the number of parameters to fit, not so much with the number of analyses. I usually process datasets with 3-6 sessions and 20-30 unknown samples. That is quasi-instantaneous. When processing the more demanding dataset of Anderson et al. (2021), it took something like 10-20 seconds because of the large number of sessions (each batch of a few tens of replicates was treated as a new session). Is this what you're doing? It might be possible to relax the convergence conditions (see here and here, ftol and xtol parameters).

My personal approach is not to process a year's worth of data at once, but usually rather to process together only the sessions related to a given project.

japhir commented 2 years ago

Thanks! Yep I didn't realize the slowness came because of all the unique Sample names. After redefining sample to mean something like "period of time for which we calculate one temperature" it is now running within several seconds, depending on the dataset.

FTR:

  • "sample" = an amount of presumably homogeneous carbonate material. Each sample should be uniquely identified by a sample name (field Sample in the csv file).
  • "analysis" or "replicate" = corresponds to a single acid reaction followed by purification of the evolved CO2 and by a series of dual-inlet IRMS measurements. Each analysis is identified by a unique identifier (field UID in the csv file, but if it's missing a default series of UIDs will be generated).

Originally posted by @mdaeron in https://github.com/mdaeron/D47crunch/issues/7#issuecomment-917159084