kasaai / quests

Adventures in research at the intersection of insurance and AI
https://quests.kasa.ai
10 stars 3 forks source link

P&C reserving tutorial in R #7

Open ryanbthomas opened 5 years ago

ryanbthomas commented 5 years ago

I want to get all of my thoughts out here. This is probably multiple inter-related projects.

Translate typical actuarial reserving workflow to R

Steps:

Also want to highlight:

Extending the Typical reserving workflow

Simulated data

Reserving Game Research Project

kevinykuo commented 5 years ago

These are definitely things we want to address, let's put this on the list and we can refine/break out projects as we go along.

Couple other reserving data simulation sources

A related project could be a standard benchmarkking methodology, think cifar-10 or mnist of reserving.

ryanbthomas commented 5 years ago

So I think I'm seeing 2 projects, which multiple phases

  1. Reserving in R

    • Phase 1: Reserve Review in R (Data Reconciliation to Report)
    • Phase 2: Mack + Bootstrapping
    • Phase 3: Parodi + Individual claim reserving
    • Phase 4: Stochastic Reserving
  2. Creating Reserving Benchmarking Dataset

    • Phase 1: Evaluate Alternative Claim Simulators
    • Phase 2: Create Multiple Scenarios and make data set available for people to validate various methodologies against.
EKtheSage commented 5 years ago

I would love to be involved in this project.

In my previous companies, we used Excel and ResQ to do reserving. It was a pain to use and people were even writing VBA in Word to generate reserving reports.

Actuaries should be able to query the underlying claims data that summarize to the triangle data, pull into R from an external database, perform usual triangle methods (Chainladder, BF, Benktander, Mack, ODP Bootstrapp, GLM, Bayesian MCMC), summarize output results into a Rmd report with method selections and reserve point estimates, range of reserves, actual vs expected, etc.

There needs to be functionality to perform actuarial adjustments, such as removing outlier in link ratios, selecting a different link ratio than the default (although I wonder if we should steer away from this kind of actuarial adjustment), and selecting the ultimate for each AY. Should these adjustments be part of a function's arguments or should they be incorporated into a Shiny App where people can just point and click to make these adjustments?

ryanbthomas commented 5 years ago

I think the "actuarial tinkering" part is something that will require a lot of thought to get right. My initial thought is that you are better off specifying thresholds for outliers and and algorithmic adjustments than allowing the user to tinker manually.

EKtheSage commented 5 years ago

I agree with you. Unless there is a very strong reason for picking a manual factor, I feel most of the time we just like to make changes to the default to show that we "contributed" something to the analysis.

kevinykuo commented 5 years ago

I think a first step would be to identify a standard (tidy) schema for "reserving data" that makes sense for most people. This could be something like lob,accident_period,development_period,type,value. Then we define an S3 class that inherits tibble with a nice print method. The reserving algorithms would then expect this data type, and output some object on which we can implement broom methods. E.g.

reserving_data <- raw_data_from_remote_db %>%
  # transformations to get it into the right format %>%
  collect()

# fit the model
mack_model <- mack(reserving_data)

# get LDFs
tidy(mack_model)

we can also provide some helpers to compare different methods, etc.

What do we do about folks who want to do old school pick factors and do paid+incurred BF? At that point a shiny app that involves writing to/reading from json/yaml might be needed (or we could support a database backend), but do we want to enable that workflow? On the one hand it may be the only way to get some people on board, on the other hand I think the old way of doing things is on the way out.

ryanbthomas commented 5 years ago

So I don't want to "boil the ocean", so starting with aggregated reserving data might be the right place to start. I do think ultimately, we want to start further up stream with the claim snapshot or transactional level data -- as that will allow us to do individual claim reserving, and Parodi's methodology.

One thing to keep in mind is that there is often a separation between the cumulative data as of some valuation date and the triangle data. At a prior employer we had an annual review of development factors which we then used in our quarterly reserve analysis, interpolating as appropriate.

This is primarily a statement about the "old school" methods (Paid/Incurred Chainladder, BF, etc) -- which I do think we want to enable folks to do. I think this will be important for getting by in and I think any other methodology is going to be compared to these so we might has well make it easy for folks. Plus, I think it illustrates the value of the SQL/R/markdown toolchain vs spreadsheet/blackbox tools (i.e. validation/reproducibility/source control).

I think you would want to treat the "old school" methods as each being another model of ultimate loss/count. I'm thinking something like recipes or parsnip's api. We would then have an algorithm to make the final estimate -- perhaps with a shiny app to allow users to make an override.

ryanbthomas commented 5 years ago

The reserving data schema you've mentioned is for a development triangle. I would suggest that lob should be replaced by something more general like segment, since the bucketing of data might be more or less granular than line of business.

One design decision that I find myself going back and forth on is whether to treat segment as a field or as metadata (attribute). There are other data items that fall into the same conceptual bucket for me, e.g. loss limitation, currency, interval (e.g., annual, quarterly). Basically data items that apply to the entire triangle.

EKtheSage commented 5 years ago

I am not too familiar with assigning these items to attribute, but wouldn't it be easier for data munging if they are all fields in a table that can be used for SQL and data manipulation?

EKtheSage commented 5 years ago

We also have to think about working with irregular-shaped triangles, for example, annual review with data at 6month, 18month, 30month etc, or extrapolate to full year ultimate loss using 3 quarters of data or 11 months of data, etc. I think ChainLadder doesn't work well with weird shape triangles, so if we want to use the package in Reserving we probably need to add this functionality either inside ChainLadder or in the tool we build.

PirateGrunt commented 5 years ago

I've taken two stabs at standardizing the structure of reserving data. The second, and probably more robust, one was in the imaginator package. This is basically a tabular representation which stores things in a very granular way. Claims have associated transactions, with slots for payments and reserves. There is a claim table which permits addition of user-defined columns to support more or less any hierarchy like LOB, segment, territory, etc. Emphasize that the intent of imaginator was to simulate. A proposed data structure for granular reserving transactions is a dividend.

The imaginator structure will support any sort of aggregation (though the package does not yet implement this). The role of observation/origin/accident periods and regular evaluation dates is something which is imposed on the data. The occurrences and associated transactions neither require nor (for the most part) are they influenced by them.

My first crack at reserving data structure was in MRMR. The version on GitHub is quite different from the one on CRAN. The beta version uses S4 classes to represent aggregated data. Since then, I've been more interested in simply working with tabular data using tibbles. I find this winds up being a lot cleaner. There's potentially a path where triangle objects could leverage generic functions with a tibble underneath the hood.

Not sure if we've explicitly talked about how data is represented in ChainLadder. I like that package, but I'm not wild about its preference for wide data and rownames. That was a large motivation for creating MRMR.

Re: picking your own factors. As an analyst, I don't think this is a good idea unless there's heavy diagnostic support. However, I get that people are going to do this. I'd love for a set of features that would generate diagnostics like residual plots which use an arbitrary set of LDFs. I've done this ad hoc and, man, talk about bias. I reviewed a set of factors where the actuary had clearly engineered the incurred results to match the paid. The residual plot was bananas.

ryanbthomas commented 5 years ago

Re: picking your own factors. As an analyst, I don't think this is a good idea unless there's heavy diagnostic support.

Completely agree with this.

Not sure if we've explicitly talked about how data is represented in ChainLadder

ChainLadder hasn't come up in this discussion, but triangle data structure is implemented as a matrix IIRC, which I think is a problem.

PirateGrunt commented 5 years ago

ChainLadder uses S3 classes for triangles, but the underlying data representation is (for most of the models) a matrix. And I'm not much of a fan. I get where it came from, largely a port from the spreadsheet implementation. A nice on-ramp for some, but not my preference.

kevinykuo commented 5 years ago

I agree the ChainLadder isn't ideal, but we can consider leveraging it for computation under the hood in the beginning if it gets us up and running.

Do we agree that the first step is figuring out the data structures? If so, would anyone like to start drafting a design doc?

PirateGrunt commented 5 years ago

As regards data structure, I find it hard to improve on a good old data frame. Depending on the feature set, it ought to work. The only question becomes how much typical metadata (length of origin period, frequency of evaluation dates and so on) you want every object to carry, if any. Analogue would be a grouped tibble. It's a data frame with a couple extra items.

ryanbthomas commented 5 years ago

I can draft a design doc. At work I typically give thought to good names and the specify the actions I want to be able to do with the data structure. @kevinykuo are you expecting more than this?

kevinykuo commented 5 years ago

@actuarialvoodoo that sounds about right, we should also outline alternative designs -- e.g. let's say we inherit tibble, do we include segment/interval/etc. as a column (so that it's repeated for each record) or as an attribute of the object (so that each object can only contain one set of segment characteristics). May be worth looking at what tsibble does with time series characteristics to get ideas.

kevinykuo commented 5 years ago

Started a repo https://github.com/kasaai/rsvr/issues/1 which should encompass the first two major "projects" outlined in the OP.