Create dedicated subsampling routine to be used for all subsampling where needed

ssadjina / FatTailedTools

Various tools and helper functions for the analysis of fat-tailed data

BSD 3-Clause "New" or "Revised" License

4 stars 0 forks source link

Create dedicated subsampling routine to be used for all subsampling where needed #5

Open ssadjina opened 1 year ago

ssadjina commented 1 year ago

The subsample routine in alpha.fit_alpha_and_scale_linear_subsampling() has been designed to include 2 kinds of general uncertainty in estimating parameters:

Uncertainty and (to some extend) bias in the available data set (using bootstrapping).
Uncertainty with respect to which origin or time shift to use when calculating log returns over several time unit periods, like 7 days based on 1-day data (randomly sampling uniformly from all possible time shifts).

Handle general kinds of uncertainty in a dedicated function allows us to reuse in all other functions in a consistent and good way.

ssadjina commented 1 year ago

The function would look something like this:

def subsample(data, func, n_subsamples, period_days, frac):

   # Set up storing results
   results = []

   for i in range(n_subsamples):

      # Randomly select a time shift/origin
      time_shift = np.random.choice(range(period_days))

      # Calculate the log returns over 'period' and using a shift 'time_shift'
      series = returns.get_log_returns(data, periods='{}d'.format(period_days), offset=time_shift).dropna()

      # Use bootstrapping to include the uncertainty wrt. to the data.
      subsample = series.sample(frac=frac, replace=True)

      # Perform desired calculation
      result = func(subsample)

      # Store results
      results.append(result)

   return results

In that case, a function 'func' is passed to execute and calculate a result. This could, for example, be a linear fit on the log-log survival function to estimate the tail exponent.

ssadjina commented 1 year ago

Because it is not clear how this is best done, here are a few alternatives:

A routine that is used to return one subsample realization. In that case the main for loop is outside the function. The advantage would be that we don't need to drag in the function func into the subsampling function. The downside is that the main loop is outside, so we may give up some control on the subsampling itself. The previous draft may also be better structured and more modular (because we define stand-alone functions func that perform some calculation on a series independent of any sampling and that then can easily be passed into the subsampling routine).
A routine that creates all the subsamples and stores them in a DataFrame. The function func can then be applied to it.

ssadjina commented 1 year ago

Current draft:

def subsample(
        data,
        func,
        func_kws           = {},
        prep_func          = None,
        prep_func_kws      = {},
        n_subsamples       = 300,
        bootstrap_fraction = 0.9
):

    # Set up storing results
    results = {}

    # Subsample loop
    for i in range(n_subsamples):

        # Prepare the data before the subsampling
        if prep_func is not None:
            series = prep_func(data, **prep_func_kws)
        else:
            series = data

        # Use bootstrapping to include the uncertainty wrt. to the data.
        subsample = series.dropna().sample(frac=bootstrap_fraction, replace=True)

        # Perform desired calculation
        result = func(subsample, **func_kws)

        # Store results
        results.update({i: result})

    return results