kurtosis-tech / kurtosis-package-indexer

Crawls Github for Kurtosis packages
0 stars 1 forks source link

Kurtosis package indexer

Kurtosis package indexer is a backend services searching for Kurtosis packages in GitHub and storing them in memory. Right now it is consumed by Kurtosis Frontend to power Kurtosis Packages Catalog.

Implementation details

The service simply runs a job periodically to search for all Kurtosis Packages currently existing on GitHub.

GitHub authentication

The searches run on GitHub need to be authenticated. There are two ways Kurtosis Package Indexer will authenticate itself on GitHub. Right now, the indexer first tries reading the GITHUB_USER_TOKEN environment variable and if it's empty, it falls back to the S3 bucket option.

Using a Github token via an environment variable

This is the simplest. The indexer expects a valid GitHub token stored inside the environment variable GITHUB_USER_TOKEN.

Using a file stored inside an S3 bucket

The indexer can also get the GitHub token from a file stored inside an S3 bucket. The file storing the GitHub token should be named github-user-token.txt and it should contain only the GitHub token on as plain text.

To access this file, the indexer will require the following environment variables to be set:

Metrics authentication

The indexer consume some Kurtosis public metrics, just package run counts for now, in order to provide this information to indexer clients like the package catalog.

Snowflake is the Kurtosis metrics storage at the moment, and the indexer is using the Go Snowflake client to execute queries on it.

It's necessary to validate a user before executing any query on this storage, we are created a new service account and a new role for this purpose, you can access into the Kurtosis Snowflake account to get this information.

The indexer will require the following environment variables to be set:

Data persistence

The Kurtosis packages information are stored by default in-memory. Everytime the indexer is restarted, it re-runs the GitHub searches to fetch the latest information about the packages on GitHub.

~~There's also the option of persisting the data to a bolt key value store, so that services can be restarted keeping the data intact. To use it, the environment variable BOLT_DATABASE_FILE_PATH can be set to point to a file on disk that bolt will use to store the data. If the indexer is being run in a container, a persistent volume should be used to fully benefit from this feature.~~

~~Ultimately, to make the indexer fully stateless, data can also be stored in an external ETCD key value store. Once the ETCD cluster is up and running, the indexer can be started with the environment variable ETCD_DATABASE_URLS set to the list of ETCD nodes URLs separated by a comma: http://etcd.node.1:2379,http://etcd.node.2:2379,http://etcd.node.3:2379.~~

The bolt db and the etcd db implementations were deprecated because these were not used in production so, we decided to deprecate them in order to simplify code maintenance.

Running as a Kurtosis Package

The following arguments that can be passed to the package:

{
  // Set to false if devving locally or in CI, this will not setup metrics reporting
  // If set to true, snowflake fields must be set
  "is_running_in_prod": "false",

  // Token to authenticate github
  // If empty, aws info will be used to retrieve token
  "github_user_token": "",

  // Optionally, a custom version of the indexer image can be used. Useful to run a dev version, like on CI
  // If empty, will build a local image based on repo code
  "kurtosis_package_indexer_version": "0.0.32",

  // Snowflake fields for setting up metrics reporting if running in production
  "snowflake_env": {
    "kurtosis_snowflake_account_identifier": "<KURTOSIS_SNOWFLAKE_ACCOUNT_IDENTIFIER>",
    "kurtosis_snowflake_db": "<KURTOSIS_SNOWFLAKE_DB>",
    "kurtosis_snowflake_password": "<KURTOSIS_SNOWFLAKE_PASSWORD>",
    "kurtosis_snowflake_role": "<KURTOSIS_SNOWFLAKE_ROLE>",
    "kurtosis_snowflake_user": "<KURTOSIS_SNOWFLAKE_USER>",
    "kurtosis_snowflake_warehouse": "<KURTOSIS_SNOWFLAKE_WAREHOUSE>"
  },

  // If it is expected that the service will get the Github user token from an S3 bucket, set aws fields
  // `aws_bucket_user_folder` can remain empty if the file containing the token is at the root of the bucket
  "aws_env": {
    "aws_access_key_id": "<AWS_KEY_ID_TO_AUTHENTICATE>",
    "aws_secret_access_key": "<AWS_SECRET_ACCESS_KEY_TO_AUTHENTICATE>",
    "aws_bucket_region": "<AWS_BUCKET_REGION>",
    "aws_bucket_name": "<AWS_BUCKET_NAME>",
    "aws_bucket_user_folder": "<OPTIONAL_FOLDER_IN_AWS_BUCKET>"
  }
}

Note that when running this package on Kurtosis cloud, the package will naturally use the AWS environment variable automatically provided to the package to fetch the GitHub token inside AWS S3.