nmdp-bioinformatics / gfe-db

Graph database representing IPD-IMGT/HLA sequence data as GFE
https://gfe-db.readthedocs.io
GNU General Public License v3.0
9 stars 15 forks source link

Handle backups, validate build output and execute validation queries in state machine #88

Closed chrisammon3000 closed 1 year ago

chrisammon3000 commented 1 year ago

Description

Toggle Automated Backups on/off

When a Step Functions execution is started, the alarm status will be ALARM for the duration of the execution until it succeeds, fails, or aborts. When the execution is finished the ALARM will be OK. When the alarm status becomes ALARM, a notification is sent to SNS which triggers the DisableBackup Lambda function to disable the Neo4j Backup Maintenance Window using the SSM API. When the alarm status changes to OK another SNS notification is sent and the DisableBackup function re-enables the maintenance window.

The logic used by the alarm evaluates several Step Functions execution metrics:

If the expression:

sum(ExecutionsStarted) - (sum(ExecutionsSucceeded, ExecutionsFailed, ExecutionsAborted)) > 0

... evaluates to true, the alarm status changes to ALARM. It will return to OK when the expression evaluates to false. This is how the alarm status is toggled based on the current number of executions in progress and finished/failed executions in the time period sampled.

Notifications are sent for each status change to the DataPipelineExecution topic and available to all subscribers.

Pre/Post Execution Backups

States are added before the build process and after the load process to run the backup script using Lambda and the Systems Manager Run Command API. The DocumentName and CommandId are returned to the state machine.

The results are stored under $.backups.pre and $.backups.post in the state machine output.

Build Output Validation

Lambda function using Polars library to run validation checks on the CSV files output in the previous build step for each release. Checks S3 assets at the file level and file contents level. If all checks pass the execution will proceed to the load stage. If any checks fail the input payload will be marked as invalid and the execution will stop.

File checks:

File contents checks:

Validation details are stored under $.validations.build in the state machine output.

Pre/Post Execution Validation Queries

Lambda function using the Neo4j Python driver to run Cypher statements against Neo4j before and after the execution. The results of these steps are meant to be used for monitoring purposes and will not affect the execution's logic.

Cypher Statements

Node counts are returned for each label in the database.

MATCH (n:GFE) RETURN count(n) as count;
MATCH (n:IPD_Accession) RETURN count(n) as count;
MATCH (n:IPD_Allele) RETURN count(n) as count;
MATCH (n:Sequence) RETURN count(n) as count;
MATCH (n:Feature) RETURN count(n) as count;
MATCH (n:Submitter) RETURN count(n) as count;

Number of distinct release versions stored on the HAS_IPD_ALLELE edge.

MATCH (:GFE)-[r:HAS_IPD_ALLELE]->(:IPD_Allele)
WITH r, apoc.coll.toSet(r.releases) as releases
UNWIND releases as release_version
RETURN DISTINCT release_version, count(release_version) as count
ORDER BY release_version;

Number of distinct release versions stored on the HAS_IPD_ACCESSION edge.

MATCH ()-[r:HAS_IPD_ACCESSION]->() RETURN DISTINCT r.release as release_version, count(r.release) as count;

Validation details are stored under $.validations.queries in the state machine output.

State Machine Output

{
  "input": [
    {
      "ALIGN": "False",
      "KIR": "False",
      "LIMIT": "1000",
      "MEM_PROFILE": "False",
      "RELEASES": "3420"
    }
  ],
  "validations": {
    "queries": {
      "pre": {
        "node_counts": [
          {
            "node": "GFE",
            "count": 1232
          },
          {
            "node": "IPD_Accession",
            "count": 1215
          },
          {
            "node": "IPD_Allele",
            "count": 1217
          },
          {
            "node": "Sequence",
            "count": 1232
          },
          {
            "node": "Feature",
            "count": 1309
          },
          {
            "node": "Submitter",
            "count": 1
          }
        ],
        "has_ipd_allele_release_counts": [
          {
            "release_version": "3430",
            "count": 1000
          },
          {
            "release_version": "3440",
            "count": 1000
          },
          {
            "release_version": "3450",
            "count": 1000
          },
          {
            "release_version": "3460",
            "count": 1000
          },
          {
            "release_version": "3470",
            "count": 1000
          },
          {
            "release_version": "3480",
            "count": 1000
          },
          {
            "release_version": "3490",
            "count": 1000
          },
          {
            "release_version": "3510",
            "count": 1000
          },
          {
            "release_version": "3520",
            "count": 1000
          }
        ],
        "ipd_accession_release_counts": [
          {
            "release_version": "3.51.0",
            "count": 1000
          },
          {
            "release_version": "3.52.0",
            "count": 8
          },
          {
            "release_version": "3.49.0",
            "count": 14
          },
          {
            "release_version": "3.48.0",
            "count": 16
          },
          {
            "release_version": "3.47.0",
            "count": 12
          },
          {
            "release_version": "3.46.0",
            "count": 52
          },
          {
            "release_version": "3.45.0",
            "count": 35
          },
          {
            "release_version": "3.44.0",
            "count": 18
          },
          {
            "release_version": "3.43.0",
            "count": 77
          }
        ]
      },
      "post": {
        "node_counts": [
          {
            "node": "GFE",
            "count": 1271
          },
          {
            "node": "IPD_Accession",
            "count": 1254
          },
          {
            "node": "IPD_Allele",
            "count": 1256
          },
          {
            "node": "Sequence",
            "count": 1271
          },
          {
            "node": "Feature",
            "count": 1334
          },
          {
            "node": "Submitter",
            "count": 1
          }
        ],
        "has_ipd_allele_release_counts": [
          {
            "release_version": "3420",
            "count": 1000
          },
          {
            "release_version": "3430",
            "count": 1000
          },
          {
            "release_version": "3440",
            "count": 1000
          },
          {
            "release_version": "3450",
            "count": 1000
          },
          {
            "release_version": "3460",
            "count": 1000
          },
          {
            "release_version": "3470",
            "count": 1000
          },
          {
            "release_version": "3480",
            "count": 1000
          },
          {
            "release_version": "3490",
            "count": 1000
          },
          {
            "release_version": "3510",
            "count": 1000
          },
          {
            "release_version": "3520",
            "count": 1000
          }
        ],
        "ipd_accession_release_counts": [
          {
            "release_version": "3.51.0",
            "count": 1000
          },
          {
            "release_version": "3.52.0",
            "count": 8
          },
          {
            "release_version": "3.49.0",
            "count": 14
          },
          {
            "release_version": "3.48.0",
            "count": 16
          },
          {
            "release_version": "3.47.0",
            "count": 12
          },
          {
            "release_version": "3.46.0",
            "count": 52
          },
          {
            "release_version": "3.45.0",
            "count": 35
          },
          {
            "release_version": "3.44.0",
            "count": 18
          },
          {
            "release_version": "3.43.0",
            "count": 77
          },
          {
            "release_version": "3.42.0",
            "count": 39
          }
        ]
      }
    },
    "build": {
      "validated": [
        {
          "ALIGN": "False",
          "KIR": "False",
          "LIMIT": "1000",
          "MEM_PROFILE": "False",
          "RELEASES": "3420"
        }
      ],
      "build_details": [
        {
          "release": "3420",
          "details": [
            {
              "schema": "all_cds",
              "release": "3420",
              "file_path": "s3://dev-gfe-db-810526023897-us-east-1/data/3420/csv/all_cds.3420.csv",
              "cols": [
                "gfe_name",
                "bp_seq_id",
                "bp_sequence",
                "aa_seq_id",
                "aa_sequence"
              ],
              "num_rows": 1000,
              "created_utc": "2023-07-31T23:48:38.000Z",
              "details": {
                "is_valid_csv_timestamp": true,
                "is_valid_csv_filename": true,
                "is_valid_csv_headers": true,
                "is_valid_csv_rows": true
              },
              "num_errors": 0,
              "is_valid_csv": true
            },
            {
              "schema": "all_features",
              "release": "3420",
              "file_path": "s3://dev-gfe-db-810526023897-us-east-1/data/3420/csv/all_features.3420.csv",
              "cols": [
                "accession",
                "hash_code",
                "locus",
                "rank",
                "sequence",
                "term",
                "gfe_name",
                "allele_id",
                "hla_name",
                "imgt_release"
              ],
              "num_rows": 11090,
              "created_utc": "2023-07-31T23:48:38.000Z",
              "details": {
                "is_valid_csv_timestamp": true,
                "is_valid_csv_filename": true,
                "is_valid_csv_headers": true,
                "is_valid_csv_rows": true
              },
              "num_errors": 0,
              "is_valid_csv": true
            },
            {
              "schema": "all_groups",
              "release": "3420",
              "file_path": "s3://dev-gfe-db-810526023897-us-east-1/data/3420/csv/all_groups.3420.csv",
              "cols": [
                "gfe_name",
                "allele_id",
                "hla_name",
                "ard_id",
                "ard_name",
                "locus",
                "imgt_release"
              ],
              "num_rows": 3000,
              "created_utc": "2023-07-31T23:48:38.000Z",
              "details": {
                "is_valid_csv_timestamp": true,
                "is_valid_csv_filename": true,
                "is_valid_csv_headers": true,
                "is_valid_csv_rows": true
              },
              "num_errors": 0,
              "is_valid_csv": true
            },
            {
              "schema": "gfe_sequences",
              "release": "3420",
              "file_path": "s3://dev-gfe-db-810526023897-us-east-1/data/3420/csv/gfe_sequences.3420.csv",
              "cols": [
                "gfe_name",
                "acc_name",
                "locus",
                "hla_name",
                "seq_id",
                "sequence",
                "length",
                "imgt_release"
              ],
              "num_rows": 1000,
              "created_utc": "2023-07-31T23:48:38.000Z",
              "details": {
                "is_valid_csv_timestamp": true,
                "is_valid_csv_filename": true,
                "is_valid_csv_headers": true,
                "is_valid_csv_rows": true
              },
              "num_errors": 0,
              "is_valid_csv": true
            }
          ],
          "errors": [],
          "expected_artifacts": [
            "all_cds.3420.csv",
            "all_features.3420.csv",
            "all_groups.3420.csv",
            "gfe_sequences.3420.csv"
          ],
          "is_valid_build": true
        }
      ],
      "has_valid_payload": true
    }
  },
  "backups": {
    "pre": {
      "document_name": "dev-gfe-db-database-Neo4jBackupDocument-nAqIhQzAdtbD",
      "command_id": "bd6b8af8-e83a-46ae-8281-197882fbd136"
    },
    "post": {
      "document_name": "dev-gfe-db-database-Neo4jBackupDocument-nAqIhQzAdtbD",
      "command_id": "645f07b0-5f1e-4b09-975f-95ca35bc0752"
    }
  }
}
chrisammon3000 commented 1 year ago

@pbashyal-nmdp conflicts resolved, this PR deploys and runs successfully

pbashyal-nmdp commented 1 year ago

Thanks @chrisammon3000 ! I'll take a look