plazi / gg2rdf

A tool to transform golden gate XML to RDF turtle
MIT License
2 stars 1 forks source link

Speed up transformation using Caching #3

Closed nleanba closed 7 months ago

nleanba commented 1 year ago

(See https://docs.github.com/en/actions/using-workflows/caching-dependencies-to-speed-up-workflows for documentation)

Currently, most of the time is spent fetching, cloning and pushing the two repositories, with a usually very short actual transformation in between

If we cache the repos, and only pull to get the current state, without needing to clone every time a single file changed, we could probably speed the transformations up by 5 to 10 minutes I think.

However, a repository may only have up to 10 GB cached data at a time. I'm not entirely sure if this disallows caching the repos — how big are they?

If size is a constraint, it might be sensible to cache only one of them (still a speed-up) or, if the issue is the git history, ensuring to remove the history (locally) after pulling or something before updating the cache

If all else fails, at the very least maybe cache the dependencies (apt packages and deno)?

@retog opinions?

nleanba commented 1 year ago

Local cloning & du -sh tells me that (treatments-xml) is 16G when shallow-cloned (--depth=2) and ~19G with full history.

That rules out caching the repos

nleanba commented 7 months ago

This issue is obsolete with gg2rdf running on our servers and not as a github action anymore