manwar / perlweeklychallenge-club

Knowledge base for The Weekly Challenge club members using Perl, Raku, Ada, APL, Awk, Bash, BASIC, Bc, Befunge-93, Bourne Shell, BQN, Brainfuck, C3, C, CESIL, C++, C#, Clojure, COBOL, Coconut, Crystal, D, Dart, Dc, Elm, Emacs Lisp, Erlang, Excel VBA, Fennel, Fish, Forth, Fortran, Gembase, GNAT, Go, Haskell, Haxe, HTML, Idris, IO, J, Janet, Java, JavaScript, Julia, Kotlin, Lisp, Lua, M4, Miranda, Modula 3, MMIX, Mumps, Myrddin, Nim, Nix, Node.js, Nuweb, OCaml, Odin, Ook, Pascal, PHP, Python, Postscript, Prolog, R, Ring, Ruby, Rust, Scala, Scheme, Sed, Smalltalk, SQL, Swift, Tcl, TypeScript, Visual BASIC, WebAssembly, Wolfram, XSLT and Zig.
https://theweeklychallenge.org
177 stars 320 forks source link

git operations take a lot of time #7358

Open andinus opened 1 year ago

andinus commented 1 year ago

Currently there are over 70,000+ files in this repository and every week we're adding 100s of files (every week a directory is created for every user and the previous "README" is copied).

I started participating with challenge-076. According to my records I've submitted solutions for 25 challenges, so there are ~100 useless directories with my name and a README file. With around 300 users, I believe this adds up.

My primary machine is not very fast and it takes 70 seconds to run git status.

andinus@ ~//perlweeklychallenge-club > time git status                
On branch master                                                                              
Your branch is up to date with 'origin/master'.                                               

It took 54.97 seconds to enumerate untracked files. 'status -uno'                             
may speed it up, but you have to be careful not to forget to add                              
new files yourself (see 'git help status').                                                   
nothing to commit, working tree clean                                                         

________________________________________________________                                      
Executed in   71.65 secs                                       

andinus@ ~//perlweeklychallenge-club > time git status -uno                 
On branch master                                                                              
Your branch is up to date with 'origin/master'.                                               

nothing to commit (use -u to show untracked files)

________________________________________________________                                      
Executed in   16.89 secs
ealvar3z commented 1 year ago

@andinus I am assuming that you've done a shallow clone?!

If you have and you're still suffering from performance issues, I can submit a patch (PR) for this issue. The following is what I have in mind:

a simple script that runs:

git repack && git prune-packed && git reflog expire --expire=1.month.ago && git gc --aggressive

Add it to a GH workflow that crons it every week.

Thoughts @manwar

P.S: @andinus if upstream does not want to the proposed PR. Please note, that you can do this to your local clone.

ealvar3z commented 1 year ago

I've just seen that the scripts directory has attempted this already. So the solution may not be upstream.

rcmlz commented 1 year ago

I am also in favour of doing some house keeping. I use zsh with some git integration and the meanwhile 90k files slow down the shell. Can the "historic" commits maybe automatically be squashed, so we have perhaps only a single commit per week on the master?

ealvar3z commented 1 year ago

@andinus I think your recommendation is the best and quick approach (i.e. deleting stale dirs w/ README files). I ran a test locally and this is what i got:

Before I ran cleanup-readme-only.sh

╔ eax@nix:test_perlweeklychallenge-club(issue/7358)
╚ λ time gs
Refresh index: 100% (88731/88731), done.
On branch issue/7358
Untracked files:
  (use "git add <file>..." to include in what will be committed)
        script/cleanup_readme_only

nothing added to commit but untracked files present (use "git add" to track)

real    0m3.350s
user    0m1.562s
sys     0m2.123s

Running clean-up-readme-only.sh

This is how long it took the shell script took to run. However, this may be just be a one time only since it deleted the entirety of the repo's history.

╔ eax@nix:test_perlweeklychallenge-club(issue/7358)
╚ λ time bash -c script/cleanup_readme_only

real    2m1.530s
user    4m33.764s
sys     3m12.803s

It got rid of 39k files (see below), but we could do better.

╔ eax@nix:test_perlweeklychallenge-club(issue/7358)
╚ λ git diff --name-only HEAD~ | wc -l
39066

Doing git status after runningclean-up-readme-only.sh

Untracked files:
  (use "git add <file>..." to include in what will be committed)
        script/cleanup_readme_only

no changes added to commit (use "git add" and/or "git commit -a")

real    0m1.082s
user    0m0.602s
sys     0m0.817s

Great improvement, but the script is too slow (even with xargs). So I rewrote it in Go! See below speed improvement.

╔ eax@nix:test_perlweeklychallenge-club(issue/7358)
╚ λ time bin/cleanup

real    0m2.658s
user    0m2.780s
sys     0m5.675s

Night and day!!!

@manwar : let me know if this is a desirable action, and I'll submit the PR (all the code and local tests are complete). See below GH Action workflow:

name: Cleanup Readmes From Repository

on:
  schedule:
    - cron:  '0 0 * * 0'  # Run at midnight every Sunday

jobs:
  cleanup:
    runs-on: ubuntu-latest

    steps:
    - name: Checkout code
      uses: actions/checkout@v2

    - name: Setup Go
      uses: actions/setup-go@v2
      with:
        go-version: 1.17

    - name: Build Go Script
      run: go build -o bin/cleanup bin/main.go 

    - name: Execute Cleanup
      run: ./bin/cleanup
andinus commented 1 year ago

I am assuming that you've done a shallow clone?!

git repack && git prune-packed && git reflog expire --expire=1.month.ago && git gc --aggressive

IIRC even after a shallow clone, running this ^, it was slow. @ealvar3z Can you share the script? I'll try running that and report back.

ealvar3z commented 1 year ago

@andinus

Please be advised that I ran this on a separate repo: cp -r perlweeklychallenge-club/ test_perlweeklychallenge-club

Here's main.go:

package main

import (
    "fmt"
    "os"
    "path/filepath"
    "runtime"
    "sync"
)

func isReadmeOnly(dir string) bool {
    files, _ := os.ReadDir(dir)
    if len(files) == 1 && (files[0].Name() == "README" || files[0].Name() == "README.md") {
        return true
    }
    return false
}

func cleanupReadmeOnly(wg *sync.WaitGroup, pathChan <-chan string) {
    defer wg.Done()
    for path := range pathChan {
        if isReadmeOnly(path) {
            os.RemoveAll(path)
        }
    }
}

func main() {
    var wg sync.WaitGroup
    ncores := runtime.NumCPU()
    pathChan := make(chan string)

    for i := 0; i < ncores; i++ {
        wg.Add(1)
        go cleanupReadmeOnly(&wg, pathChan)
    }

    err := filepath.WalkDir(".", func(path string, d os.DirEntry, err error) error {
        if d.IsDir() {
            pathChan <- path
        }
        return nil
    })

    if err != nil {
        fmt.Println("Error:", err)
    }

    close(pathChan)
    wg.Wait()
}

And the bash script:

#!/bin/bash

cleanup_readme_only() {
  num_cores=$(nproc)
  find . -type d -print0 | xargs -0 -I {} -P "$num_cores" bash -c \
  'if [ "$(ls -A {})" = "README" ] || [ "$(ls -A {})" = "README.md" ]; \
  then rm -rf {}; fi'
}

cleanup_readme_only
andinus commented 1 year ago

It does improve performance, previous these took 71, 16 seconds. Takes about 8, 4 seconds now.

andinus@~/d/o/C/perlweeklychallenge-club (master)> time git status > /dev/null
Refresh index: 100% (93480/93480), done.

________________________________________________________
Executed in    8.44 secs    fish           external
   usr time    1.65 secs    0.00 micros    1.65 secs
   sys time   14.08 secs    0.00 micros   14.08 secs

andinus@~/d/o/C/perlweeklychallenge-club (master)> time git status -uno > /dev/null
Refresh index: 100% (93480/93480), done.

________________________________________________________
Executed in    4.34 secs    fish           external
   usr time    1.01 secs    0.00 micros    1.01 secs
   sys time   10.64 secs    0.00 micros   10.64 secs
jo-37 commented 5 months ago

Maybe this issue depends on the workflow in use. In my setup I don't experience such performance issues.

I'm operating on three branches in my fork of perlweeklychallenge-club:

Synchronize master and contrib from upstream, then create a new branch ch-xxx from contrib, build the solution therein, merge ch-xxx into contrib, push to github and create a pull request from the contrib branch. Delete ch-xxx after it has been merged into master (and is finalized).

Updates are always fast-forward / incremental this way.