Backfill aarch64 results data for daily commits

diegorusso commented 7 months ago

We need to backfill aarch64 results data for daily commit spanning back at least a few months (until May 2023)

### Tasks
- [x] Get all commits in codespeed from May 2023 till today belonging to main branch

diegorusso commented 6 months ago

aarch64 results are being currently backfilled. I will update the issue once I've done.

diegorusso commented 6 months ago

And they are now back filled. The issue can be closed now.

mattip commented 6 months ago

We also need to do some of this for PyPy (speed.pypy.org) since we moved to git last month. How did you do the backfill?

diegorusso commented 6 months ago

I can share the scripts I used to do the migration and the backfill. Bear in mind this was very tailored to the case of speed.python.org. I'm not sure what the situation for pypy is.

mattip commented 6 months ago

Ahh. I assume these are DB manipulations scripts? Sure, that might be helpful to give me a hint how to do this for us.

diegorusso commented 6 months ago

I've commented the issue with information that weren't relevant hence deleted the comments. @mattip before giving you the right answer, can you tell me what exactly you need to do?

Questions I have:

what SCM technology you are migrating from?
how do you map commit from the old SCM with git commits?
how far do you want to go back with the backfill?

mattip commented 6 months ago

We have results for pypy-64 and pypy-jit-64 that relate to python2.7. When we used mercurial (until Jan 6) the branch was named "default". Since then we use git, and the branch is named "main". So for this benchmark for instance, you see it has no results before Jan 6, since the branch for those results is "default". So my answers are

mercurial
I am not sure what you are asking, there is no overlap of commits. Each project can only have one SCM and one default branch, but the timeline does not know about the SCM. Since the py3.9 branch is consistent between the two repos, the history can go back as far as we have data. But on the pypy-64 or pypy-jit-64 projects, the branch name changed on Jan 6, so there are no results for the PyPy project before that date (click on the "benchmarker" environment radio button, note how there is no previous result. Now click on one of the PyPy3.9 executables, note how there are results before Jan 6).
I guess it would be nice to go back a year or two.

diegorusso commented 6 months ago

I was thinking along the line of mapping commits from mercurial to git. For instance these two commits are the same:

The above commits are the last one to be in common between the repos. Before those commits one can try to map the mercurial commit to the git ones so you don't need to backfill anything, data is already there but I'm not an expert of mercurial hence I don't know if this is feasible or not.

Provided the fact you cannot convert in the database mercurial revisions to git revisions (I think it is possible though) then you can resort to the actual backfill. I mean the process could be easy, depending on your machine. If you want to run them sequentially I guess you can stick all the commits that you want at the end of benchmark.conf and run pyperformance with compile_all The format is sha1=main

If you want to run things in parallel (we did it because we have CPU isolation on the AArch64 machine), collect all the revisions you want to test in a txt file and write a simple python script that calls pyperformance with every revision to test.

Our script was something like that

from multiprocessing import Pool
from pathlib import Path
import subprocess
import sys
import signal

def get_revisions():
    revisions = []
    with open("backfill_shas.txt", "r") as f:
        for line in f:
            sha, branch = line.split("=")
            revisions.append((sha, branch.rstrip()))
    return revisions

def run_pyperformance(revision):
    sha = revision[0]
    branch = revision[1]
    print(f"Running run-pyperformance.sh with sha: {sha}, branch: {branch}")
    output_dir = Path("output")
    output_dir.mkdir(parents=True, exist_ok=True)
    out_file = output_dir / f"{branch}-{sha}.out"
    err_file = output_dir / f"{branch}-{sha}.err"
    with open(out_file, "w") as output, open(err_file, "w") as error:
        subprocess.run([
            "./run-pyperformance.sh",
            "-x",
            "--",
            "compile",
            "benchmark.conf",
            sha,
            branch,
        ],
        stdout=output,
        stderr=error,
        )

if __name__ == '__main__':
    pool = Pool(6)
    try:
        res = pool.map_async(run_pyperformance, get_revisions())
    except KeyboardInterrupt:
        print("Caught KeyboardInterrupt, terminating workers")
        pool.terminate()
    else:
        print("Normal termination")
        pool.close()
    pool.join()

./run-pyperformance.sh is a wrapper script that has some logic around running pyperformance in parallel (lock files, etc..)

Depending on the revisions you need to backfill, the process could be lengthy. I strongly suggest the first approach, the one that modifies the data you have in order to map the new git revisions.

mattip commented 6 months ago

Maybe we have drifted off the original issue far enough that the title needs expanding, or I can open a new issue?

We can generate a bi-directional mapping of commits on PyPy between mercurial and git using the methodology we used to migrate the repo. From the GUI, I can drill down to a particular benchmark result for a particular revision, and I see there I can add another revision. Any idea how I would do that in a SQL query? Then I could, for each interesting hg hash, find all the results and add the corresponding git hash.

diegorusso commented 6 months ago

Maybe we have drifted off the original issue far enough that the title needs expanding, or I can open a new issue?

I don't mind either way.

We can generate a bi-directional mapping of commits on PyPy between mercurial and git using the methodology we used to migrate the repo. From the GUI, I can drill down to a particular benchmark result for a particular revision, and I see there I can add another revision. Any idea how I would do that in a SQL query? Then I could, for each interesting hg hash, find all the results and add the corresponding git hash.

Ok, if there is a way to map mercurial revision with git revisions, I would strongly suggest to add these revisions on codespeed and then associate the results to the new git revisions. It will be much easier than re run 2 years worth of benchmarks.

For doing so I strongly suggest not to use SQL directly but use Django ORM as it will be less prone to error and easy to read. As an example, here is the script that we used to migrate data from master to main branch

from codespeed.models import Result, Revision, Branch, Report
from django.core.exceptions import ObjectDoesNotExist

# Get the branches
master_branch = Branch.objects.get(name="master")
main_branch = Branch.objects.get(name="main")

# Get all master Results
master_results = Result.objects.filter(revision__branch__name=master_branch.name)

# We need to iterate over the master resuls and change
# the branch of the revision from master to main 
for result in master_results:
    revision = result.revision
    # We have 2 cases
    try:
        # This is when we have 2 revision with the same commit id but different branches.
        # We need to get the new revision and update the result with the new revision.
        new_revision = Revision.objects.get(commitid=revision.commitid, branch=main_branch)
        result.revision = new_revision
        result.save()
    except ObjectDoesNotExist:
        # If the revision doesn't exist with the main branch we need to update the current one.
        revision.branch = main_branch
        revision.save()
    print(result)

# We also need to update the reports as well.
master_reports = Report.objects.filter(revision__branch__name=master_branch.name)

for report in master_reports:
    revision = report.revision
    new_revision = Revision.objects.get(commitid=revision.commitid, branch=main_branch)
    report.revision = new_revision
    report.save()
    print(report)

I hope this helps you to figure out the logic for mapping mercurial commit to git commit and then change the results to point to the new git commits.

mattip commented 6 months ago

Thanks, exactly what i was looking for

python / codespeed

Backfill aarch64 results data for daily commits #42