hpjansson / fornalder

Visualize long-term trends in collections of Git repositories.
GNU General Public License v3.0
96 stars 10 forks source link

dropping the "lines changed" measurement for performance reason? #13

Closed gasche closed 3 years ago

gasche commented 3 years ago

Thanks again for fornalder, it's a very interesting tool.

I was wondering if you had opinions about the idea of dropping the "lines changed" measurement from the tool, and related interface options. There are two different aspect that make me think this could be a good idea:

  1. It's not a very good metric: In practice I find that the "line changed" count is extremely noisy, not useful to study repositories over time. It gets worse when you aggregate different repositories of a community, that may have different practices. (One way to phrase this: I was never able to succesfully interpret the results of --unit changes in an interesting way for the repositories I was looking at.)

  2. Performance: If we didn't have to compute this information at ingest time, then we could ask git to clone the repository but don't actually download the trees, just the history. This can be done with Partial Clones, I think that git clone --filter=blob:none would work. This would massively decrease the cloning time to start the analysis, which takes a substantial part of the total analysis time.

Concrete example: for a long list of repositories I wanted to look at, cloning everything in parallel takes about 30mn, ingesting also takes 30mn (presumably this could be parallelized). Doing blob:none partial clones completes in 3mn. (The difference could be arbitrarily large for repositories that require downloading large amount of data.)

One could of course consider adding a --no-changes option to the tool that would not collect this information and be compatible with those partial clones, and possibly even record this fact in the database to later fail properly if users try to plot changes. But:

hpjansson commented 3 years ago
  1. It's not a very good metric: In practice I find that the "line changed" count is extremely noisy, not useful to study repositories over time. It gets worse when you aggregate different repositories of a community, that may have different practices. (One way to phrase this: I was never able to succesfully interpret the results of --unit changes in an interesting way for the repositories I was looking at.)

Agreed. I've used it a bit to look for outliers in the data, but in general it's not very useful.

  1. Performance: If we didn't have to compute this information at ingest time, then we could ask git to clone the repository but don't actually download the trees, just the history. This can be done with Partial Clones, I think that git clone --filter=blob:none would work. This would massively decrease the cloning time to start the analysis, which takes a substantial part of the total analysis time.

In the manual use case, you typically clone once followed by multiple ingest/plot, and repositories can be updated incrementally after that, so I didn't consider it a big performance issue. Additionally, not all servers (http...) support --filter.

That said, maybe you're writing a wrapper script that clones, performs the analysis and then cleans up after itself -- partial clones could save a lot of resources there if the script is going to be run regularly. And on limited/metered connections it could be a life saver.

One could of course consider adding a --no-changes option to the tool that would not collect this information and be compatible with those partial clones, and possibly even record this fact in the database to later fail properly if users try to plot changes. But:

  • This could be invasive, while removing changes altogether makes the script simpler, which is good.

  • Git partial clones don't fail nicely if you try to ask for changes (or anything blob-related) after the fact, they try to re-download it on demand and the performance is terrible (I've observed one request for each commit when asking git log --stat). This means that if we fail to protect some blob-depending parts of the script in the codebase, the user experience is going to be horrible in this case. I would find it easier if we could just globally enforce that only the history, not the patches, are accessed by the fornalder ingestion logic.

One of the recurring comments on my GNOME project analysis is that translation commits are fairly numerous there and seen as orthogonal to code contributions. They are also easy to separate mechanically based on file names (po/ subdir, .po extension). Similar arguments could be made for build/release engineering, documentation, icons, etc.

Keeping the --stat would allow us to filter the commits or assign cohorts based on file type (by extension or regexps over file names/paths, perhaps). It might also be useful in tracking efforts to switch to a new programming language (e.g. from C to C++ or Rust -- GCC did the former, and there are several projects experimenting with the latter).

So... I like the --no-changes idea, even with the downsides. Maybe call it --no-blobs as eventually it'll affect more than just the change statistic.

An attractive idea would be to detect filtered repositories on ingest and suppress --stat automatically. Then we wouldn't need a flag at all. We'd probably have to warn that this was happening, and especially when finding a mix of filtered and unfiltered repositories, to avoid silently generating plausible-looking but incorrect plots down the line.

gasche commented 3 years ago

Ah, the point on commit classification is well taken. So maybe this idea of getting rid of changed is not a good move.

My main reason not to keep my clones around is disk space. I ran the analysis on all OCaml projects available on opam, the language-specific package manager, this is 2461 projects that altogether consume 12Gio of disk space, while the resulting sqlite database fits in 205Mio.

I'm closing this issue for now: a good rationale for the current state has been proposed, and I in any case I don't have any short-term plans to work on it. If there is interest again we can always reopen.

hpjansson commented 3 years ago

@gasche I went ahead and implemented the automatic --stat suppression -- please let me know if it improves things.

gasche commented 3 years ago

Thanks! That looks useful indeed. I didn't know about "provisor", so I'm glad you did the work :-)

One minor thing I wondered about when reading your code. I just tested locally, for standard repositories (not partial clone), git config remote.origin.provisor prints no output and returns 1. I hope that no failure happens within the script in this case -- I suppose not or it would break all the time.

If we want this feature to be useful to other people than you and me, it would be nice to have some documentation, for example in the README, on which cloning modes are supported. It could mention that bare and non-bare repositories are supported, and that --filter=blob:none can be used to massively reduce data transfer if (1) the server supports it and (2) the user does not need changed-lines measurements or --stat metadata.

gasche commented 3 years ago

Maybe a proposal (I could turn it into a PR but I think it's probably faster for you to just edit it to your liking):

Tips

We support ingestion from both bare and non-bare repositories:

git clone https://git.example.com
git clone --bare https://git.example.com

Another option that can massively reduce data transfer is --filter=blob:none, which will only clone commit metadata, not the files themselves.

git clone --bare --filter=blob:none https://git.example.com

Note that this is not supported by all git servers (cloning may fail), and that this prevents the use of the --unit changes option (counting the number of changed lines), or in general inspecting --stat output from the commit database.

hpjansson commented 3 years ago

Looks good, I'd take a PR for that :)

gasche commented 3 years ago

I'm playing with partial clones now and I'm very happy with the result.

One aspect I had not anticipated is that ingesting data is also much faster for those repositories. (Possibly just due to the overall performance gains of having a smaller git repository.) On one specific repository for example ingestion time went from 25s to 5s. On a much larger set of repositories (some of which coming from servers that do not support --filter=blob:none), it went from 30m to 8m.

So thanks!