Update ESL preprocessing to be able to read in GESLA3 data

Timh37 commented 1 year ago

This may be a good starting point.

Do we want separate modules for GESLA2 or GESLA3 or a script that is able to both?

bobkopp commented 1 year ago

Additional notes from Tim:

One module could compute amplification factors like the current module but read in GESLA3 instead of GESLA2 data and work with our automatic threshold selection. That could be extended to a second module to compute timing.

Agree the current summary metrics could be part of a postprocessing stage. However, the computation of the timing metric for the NCC paper purposely takes a different input, namely return heights as a function of input frequency instead of return frequencies as a function of input heights. So, I believe a separate postprocessing module may be needed. I’d suggest, though, that we first focus on the preprocessing and fitting stage, i.e., 1) being able to read in GESLA3 data, and 2) using GPD fits based on automatic threshold selection, as in the NCC paper. The second may be as easy as simply reading in GPD parameter samples from the NCC paper data which are publicly available.

bobkopp commented 1 year ago

@Timh37 I would say that if the code is identical except for the input data file, I would make the input data file a parameter that can be specified in pipeline.yml. FACTS now allows you to have multiple pipeline.yml files with a module (with the one labeled "pipeline.yml" serving as the default), so we could provide pipeline.yml files for both versions. (Whether we want to carry around both versions of the data set is a separate issue, I might need convincing.)

I would make the automatic threshold selection a (default?) option also, since it's nice to have the same code with both automatic and manual selections.

bobkopp commented 1 year ago

With respect to the post processing, I don't have a good sense of how large the pickle file would be to transfer ESL samples between the stages. If it's unmanageably large, we might need to do summary metric computations in the project stage. If it's not, I don't see any metrics calculated across different axes need to be broken out. Or are you saying we compute the metrics from the quantiles of the distribution rather than the samples themselves? In that case, I would think the projection step needs to output quantiles along both dimensions.

If the output file from the projection step isn't too large, and the summary metrics are computationally light (as they should be if they are calculated from distributional summary statistics) calculating summary metrics actually seems like something that belongs in a Jupyter notebook rather than a postprocessing stage. It's always bothered me that specific return levels were coded into the module itself -- you should be able to generate AFs etc for any desired return level without rerunning the module.

bobkopp commented 1 year ago

Consider addressing https://github.com/radical-collaboration/facts/issues/188 in the set of updates to support GESLA3.

bobkopp commented 1 year ago

To be addressed by @Timh37 in the context of PROTECT.

bobkopp commented 8 months ago

@Timh37 is working on code here: https://github.com/Timh37/projectESL/tree/main -- needs to be turned into a FACTS module.

bobkopp commented 2 months ago

@AlexReedy Is this all working in the development branch now?

radical-collaboration / facts

Update ESL preprocessing to be able to read in GESLA3 data #168