metaspace2020 / metaspace

Cloud engine and platform for metabolite annotation for imaging mass spectrometry
https://metaspace2020.eu/
Apache License 2.0
44 stars 9 forks source link

Calculate cost for each step of the dataset #1484

Closed sergii-mamedov closed 4 months ago

sergii-mamedov commented 4 months ago

General information is available in this task.

Technical aspects

Cloudwatch

Lambda function saves logs and metrics in Cloudwatch Logs (CW Logs) after each invoke. To calculate costs, we use the last line of the logs:

REPORT RequestId: aa4d6d3e-ffb2-47e0-9593-c37321c4af93  Duration: 9717.25 ms    Billed Duration: 9718 ms    Memory Size: 2048 MB    Max Memory Used: 523 MB

Multiplying the Billed Duration and Memory Size and knowing the price of 1 GB*sec, we will have the costs for each lambda function run. For speed, we make a request to CW during the dataset processing time (start/finish from perf_profile table). Next, we match request_id from perf_profile_entry table with RequestId from CW Logs. This way we can calculate the costs for each step. Unfortunately, AWS flushed AWS Lambda logs in CW with a rather long delay (1-3 minutes). In part, I attribute this to move `save_size_hash' function before calculating the costs (md5 hash calculation for medium and large datasets takes tens to hundreds of seconds). However, unfortunately, we often have to wait tens of seconds until all logs are available.

EC2

Information about the start and stop of an EC2 instance is available in CloudTrail. However, this information is redundant for us. In the perf_profile_entry table, we have information about the start/finish of each step, and this coincides with the data from CloudTrail.