Automatically collect and track benchmark results over time

Internally we already talked about this a few times in the past and already outlined the first steps; The goal is to track the performance of our binding over time in an automated and reproducible way.

As a reminder; we already have a benchmark project which tests the performance of our binding in key areas compared to typed and untyped gdscript and it already outputs a usable json with key metrics and raw data.

While this is useful, it never was a clear indication of our performance as it was never run on the same machine over time, and had to be executed manually.

Hence we decided the following which this issue should track:

Setup a self-hosted windows 11 github runner (we cannot use one provided by github as these do not have consistent performance characteristics. Like different CPU models with different IPC and so on)
Setup a github workflow which runs the benchmarks after changes
Aggregate the resulting data automatically in spreadsheets and visualize the performance over time (here we talked about firebase cloud functions and google sheets. more on that later)
Make these metrics publicly available for full transparency

The following has already been done:

I set up a self hosted github runner at home with dedicated hardware. The sole purpose of this runner is to run these benchmarks and nothing else.
A test spread sheet is set up (Note: the current data in this sheet is NOT representative! It's just TEST DATA!) and data is automatically aggregated with the following AppScript (a first draft):
```
const spreadsheetId = '<sheet_id>'
```

function doPost(e) { var jsonData = JSON.parse(e.postData.contents); var timestamp = new Date().toLocaleString(); var spreadsheet = SpreadsheetApp.openById(spreadsheetId);

// Removing all existing charts in the Dashboard var dashboard = spreadsheet.getSheetByName("Dashboard"); if(dashboard) { var charts = dashboard.getCharts(); for(var i=0; i<charts.length; i++) { dashboard.removeChart(charts[i]); } }

for(var benchmark in jsonData['data']) { for(var language in jsonData['data'][benchmark]) { // Naming the sheet as 'benchmark|language' var sheetName = benchmark+"|"+language; var sheet = spreadsheet.getSheetByName(sheetName);

  // If the sheet does not exist, create a new one
  if(!sheet) {
    sheet = spreadsheet.insertSheet(sheetName);
    var header = ['Timestamp', 'avg', 'min', 'max', 'median', 'p05', 'p95'];
    sheet.appendRow(header);
  }

  // Convert JSON data to row data
  var row = [timestamp];
  row.push(jsonData['data'][benchmark][language]['avg']);
  row.push(jsonData['data'][benchmark][language]['min']);
  row.push(jsonData['data'][benchmark][language]['max']);
  row.push(jsonData['data'][benchmark][language]['median']);
  row.push(jsonData['data'][benchmark][language]['p05']);
  row.push(jsonData['data'][benchmark][language]['p95']);

  // Append language data to sheet
  sheet.appendRow(row);
}

}

// Create benchmark comparison line graphs var allSheets = spreadsheet.getSheets(); if(!dashboard) { dashboard = spreadsheet.insertSheet("Dashboard"); }

var benchmarkCharts = {}; var benchmarkChartsSeries = {};

for(var i=0; i<allSheets.length; i++) { var sheetName = allSheets[i].getName(); if(sheetName != "Dashboard"){ var benchmarkName = sheetName.split("|")[0]; var languageName = sheetName.split("|")[1]; var lastRow = allSheets[i].getLastRow();

  // If it's a new benchmark, initialize a new chart builder
  if(!(benchmarkName in benchmarkCharts)) {
    benchmarkChartsSeries[benchmarkName] = [{labelInLegend: languageName}]
    benchmarkCharts[benchmarkName] = dashboard.newChart()
      .asLineChart()
      .setOption('title', 'Performance Trend for ' + benchmarkName)
      .setOption('hAxis.title', 'Time')
      .setOption('vAxis.title', 'Average Score');
  } else {
    benchmarkChartsSeries[benchmarkName].push({labelInLegend: languageName})
  }

  // Adding the range from current sheet excluding header and adding language as a series
  benchmarkCharts[benchmarkName].addRange(allSheets[i].getRange(2, 1, lastRow - 1, 2));
}

}

// Build and insert all benchmark charts var position = 1; for(var benchmark in benchmarkCharts) { // Update position for the new chart. benchmarkCharts[benchmark] .setPosition(position * 20, 1, 0, 0) .setOption('series', benchmarkChartsSeries[benchmark]); var chart = benchmarkCharts[benchmark].build(); dashboard.insertChart(chart); position++; // Update position for the next chart } }



The following needs to be done:
- [ ] Give RDP access to other maintainers for maintenance
- [ ] Add self-hosted runner to utopia-rise organisation
- [ ] Setup workflow to run the benchmarks on changes
- [ ] Setup final spread sheet and app script

The following points can be improved at a later stage:
- Possibly Migrating from AppScript and google sheets to Firebase cloud functions and firestore (or Supabase equivalents)
- Run benchmarks as part of PR pipeline to see possible performance problems as part of the PR process

utopia-rise / godot-kotlin-jvm

Automatically collect and track benchmark results over time #633