hpjansson / fornalder

Visualize long-term trends in collections of Git repositories.
GNU General Public License v3.0
95 stars 10 forks source link

a workflow to remove a repository after ingestion? #15

Open gasche opened 3 years ago

gasche commented 3 years ago

It has happened to me several times now that I ingest a large set of repositories, I look at the data, and I notice oddities caused by a repository that should not have been there in the first place.

Is there a workflow to remove a repository from the database, and rerun the plotting?

Currently I don't know of such a workflow, so I manually remove the repository, delete the database, and restart ingestion from scratch. This is ok, but it can be annoying when ingestion is slow (several minutes on large repository sets).

I thought about running sqlite on the database and doing a DELETE operation on all raw_commits coming from this directory. However, if I understand correctly, the plotting data comes from the authors table that I would need to update with new aggregates, and I don't know how to do it easily.

Assuming this does not currently exist, my proposal would be to have a command fornalder reanalyze foo.db that would drop the current authors table and recompute it from the raw_commits table as it currently exists.

(Another option of course would be to have a fornalder repo-remove foo.db repo.git command that removes a repository from a table, instead of adding it as fornalder ingest foo.db repo.git does. But that sounds like more work.)

hpjansson commented 3 years ago

The authors table gets derived from raw_commits every run, so it should be safe to poke around in the latter. See: https://github.com/hpjansson/fornalder/blob/43f3d4872e52100ade22d2cb29f744a6847c3ac4/src/commitdb.rs#L204-L233

I intended to re-run postprocess() only if something changed (e.g. store a hash of the meta file provided, clear a flag whenever a fornalder command like ingest changes the database), but it wasn't too slow in practice, so I didn't feel the need to optimize it, at least not yet. I left a reminder here:

https://github.com/hpjansson/fornalder/blob/43f3d4872e52100ade22d2cb29f744a6847c3ac4/src/main.rs#L217

Anyway, the bottom line is that manually editing raw_commits is safe, for now.

I like the idea of having CLI for common database editing (like removing a repo, or maybe a date range). Let's keep this issue open for repo-remove (or remove-repo?).