spotify / XCMetrics

XCMetrics is the easiest way to collect Xcode build metrics and improve developer productivity.
https://xcmetrics.io
Other
1.1k stars 77 forks source link

Build log greater than 1MB crash XCMetrics Backend when writing to Postgres #61

Closed ghost closed 2 years ago

ghost commented 2 years ago

I have been trying to setup XCMetrics for a large application I work on as part of my day job.

There is an issue when processing the build logs as it seems to time out the connection to Postgres, which in turn crashes the xcmetrics container.

Then the next line in the logs is the xcmetrics backend rebooting (i've got restart always on in my docker compose file).

I tried experimenting using a much smaller app to see if it was related to log size - and it appears that the log size is the cause of the issue, based on my observations doing that. The smaller app logs always work, pretty much immediately, but the larger apps logs never finish processing.

Attached below is an image showing the network requests to the backend - You can see they successfully send to the backend server, but the issue happens during processing. The large requests never finish processing (have left for more than 24hrs) due to a Postgres error of unexpected EOF on client connection with an open transaction

logs_too_big

And here is a screenshot of the job_log_entries table in Postgres, the red ones are from the big app, the green ones from the small app.

Screenshot 2021-11-09 at 3 14 10 pm

Are these logs too big for XCMetrics or is something else going on here? Do you have any benchmark for the largest parsable file you've managed to process?

Thanks for making this tool and hope we can figure this out so our team can use this tool!

ghost commented 2 years ago

Okay figured this out - we had SwiftLint on in a build step which was causing up to 240mb of log output per run.

I figured this out after a colleague said they were able to get this running when they gave more resources to Docker Engine. Sure enough I did the same and it worked (4 Cores, 8GB RAM). However thats absurd for what this tool is doing so it sounded like a optimisation problem so I started thinking how could I figure out what xcmetrics/postgres is spending its time doing - first port of call was to see large tables in Postgres by running;

select schemaname as table_schema,
    relname as table_name,
    pg_size_pretty(pg_total_relation_size(relid)) as total_size,
    pg_size_pretty(pg_relation_size(relid)) as data_size,
    pg_size_pretty(pg_total_relation_size(relid) - pg_relation_size(relid))
      as external_size
from pg_catalog.pg_statio_user_tables
order by pg_total_relation_size(relid) desc,
         pg_relation_size(relid) desc
limit 10;

Which gave a table showing a build_steps partition was taking up 480mb and I had only ran two builds. That doesn't sound right. Postgres died when trying to query this table, so was fairly sure this was the issue.

Adding --quiet to SwiftLint then reduced the logging size down to about 120kb per log file per run, I was able to drop the resources to Docker Engine down again and all was right in the world.

Commenting incase anyone else comes across this issue :).