Open jonchurch opened 4 years ago
So the issue was that indexing all of the repos takes a while, in my tests it is around 12min. There are a ton of perf improvements (like parallelizing some of the calls) which could help, but IMO there is one easier and more important thing to change. It just shouldn't index all projects on every run.
Things in many don't change that often, so if we just kept a flag for the last time something was successfully indexed, and limited the number of projects indexed on each run to like 5 it would never hit the timeout and then since it runs every 2 hours it would take less than 24 hours to index each project.
Adding this would be as easy as on complete for a repo writing to the leveldb with the marker then using this to pick the oldest 5 instead of iterating the whole list.
AFK but been thinking about this
I havent benchmarked anything yet, so dunno if this would matter, but for v1 would it make sense to cut out the travis.yaml, package.json, and requests to NPM?
Assumptions here being that v1 focuses on getting issues from github to populate the statusboard.
Im currently reading some leveldb docs and the associated code in the repo to implement the lastIndexed value for projects 👍
An aside, but I previously worked on writing graphql queries to get the issues and labels. I have the queries committed on a branch of my fork, but there is still a need (Im assuming) for some repos to paginate past the 100 issue max that the graphql github api limits you to.
cut out the travis.yaml, package.json, and requests to NPM?
Yeah it would make sense, and those are what I was talking about when I said "There are a ton of perf improvements". But in the end the github data is the longest time spent. By all means cut those calls out for now, but I am not sure that alone will get it reliably under time.
Im currently reading some leveldb docs and the associated code in the repo to implement the lastIndexed value for projects 👍
Awesome! Let me know if you have questions.
paginate past the 100 issue max that the graphql github api limits you to
This is why I used the rest api, it was just more simple and well documented. If we can use the graphql api then great!
The graphql API isn't super fun to use, but I was able to write some scratch code tonight that grabbed all the open issues across the 3 orgs in around 7 seconds when requesting the orgs sequentially.
There's only two repos across all orgs that even have more than 100 issues open (multer and express). 100 is the most you can grab from a repo in a single trip.
Here's the gist, I ran this from the statusboard project (requires a GITHUB_TOKEN
to be set in the env, and node-fetch
).
I looked at the levelDB implementation for a while too, not ruling that out, but I wanted to revisit the graphql approach. I realized there would need to be some logic in place to handle a fresh build, or when a new repo is added to an org. If you're querying the db for the least recently updated project, that assumes all projects have already been indexed. So some logic would have to be written to check the DB for a project's existence, and prioritize that one if a record wasn't found.
The graphql API isn't super fun to use, but I was able to write some scratch code tonight that grabbed all the open issues across the 3 orgs in around 7 seconds when requesting the orgs sequentially.
The existing implementation also pulls closed issues, you will need to do this to track activity between indexes. An issue can be opened and resolved very quickly.
So some logic would have to be written to check the DB for a project's existence, and prioritize that one if a record wasn't found.
Yep, the config file would be the source and then sort no last index to the top.
I'm thinking now that the biggest perf gains would be in running some of this code concurrently.
Currently the control flow iterates over each org and repo, doing work sequentially and essentially blocking until each repo is completed, then moving on to start over with the next org. I'm not sure how to switch to concurrency with the async generators.
I think this could build out a statusboard for all three express orgs in as little as a few minutes if we were able to run loadProjects
concurrently for all repos in an org. Most repos currently take less than 30 seconds to process on their own.
I'm probably overthinking things, I'll come back to look at the code with fresh eyes on monday.
To benchmark loadProject
for each repo I made the following additions locally in lib/db/build-index.js
:
async function * iterateProjects (config) {
// ...
try {
for await (let repo of github.getOrgRepos(graphQL, org.name)) {
+ const start = Date.now()
const proj = new Project({
repoOwner: repo.owner,
repoName: repo.name
})
for await (let evt of loadProject(octokit, graphQL, proj, config, repo)) {
yield evt
}
+ console.log(`====FINISHED PROCESSING ${repo.name} in ${(Date.now() - start) / 1000} seconds`)
}
} catch (e) {
yield projectDetail('ERROR', org, e)
}
}
}
So funny enough I had to do something almost exactly like this at work but on a much larger scale (around 8k projects) and it got me thinking about the next steps here. There is a very clear limit, the per process memory limit, at which doing things in parallel breaks and I think the express version of this would exceed this threshold. Up until that limit, you are absolutely right, it greatly speeds things up.
FWIW, storing in leveldb
is what I used to get past this limit here, and it is also what I did for my task at work. The downside in both cases is, to get past the memory limit the best way is go multi thread or multi process, but leveldb
is neither of these. That brings me to my next plan, which is to split into multi leveldb
instances where we shard based on repo to enable mult-process.
One other option would be to write the json to disk directly and use worker threads for the parallelism (same with sharding by repo). The downside here is managing the writes to disk in a resilient and efficient way. This is more complicated than it seems and leveldb does this well.
On a side note, check out this api for this kind of measurement: https://nodejs.org/api/perf_hooks.html
I wrote a comment last week that I never sent bc it was too long lol. Here’s an abbreviated version.
I realized that the shortest path w/ biggest speed gain atm was to parallelize the network requests on a per org basis.
What I coded was ugly, imo, but kicking off all network requests for an org got the three express repos down to a 4 minute and change build time (total). Idk how close I got to the memory limit, which is an important point.
I fetched the data for each repo then handed it off the the generators for iteration.
That sais, with the graphql work down to 11 minutes or so, if it doesnt time out in a github action, we’re probably good to go on our MVP.
Yeah that sounds good. The goal is get it functional, no one cares as long as the information is in front of people! I did a review of #10 yesterday, so once we get that merged we can update on https://github.com/expressjs/statusboard and merge that initial work PR
@wesleytodd Per your comment https://github.com/expressjs/discussions/issues/92#issuecomment-578868525
Let's start a discussion about this here. I have cruised through this codebase and run it locally a few times, so I have some context as to how it works.
If I can't find the time to help, at least we'll have some record here of what should be done or is happening.