Support mailbox import from public-inbox (lore.kernel.org mailing list archives)

bulwahn commented 6 years ago

So first, here is my expectation from a user perspective:

My linux.cfg would be defined as follows:

PROJECT_NAME = Linux
### REPOSITORIES
REPOSITORY_URL_torvalds = git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
REPOSITORY_URL_stable = git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git
REPOSITORY_URL_linux-next = git://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git

### MAILBOXES
PUBLIC_INBOX_URL_lkml = git://git.kernel.org/pub/scm/public-inbox/vger.kernel.org/lkml/[0-9].git
PUBLIC_INBOX_URL_selinux = git://git.kernel.org/pub/scm/public-inbox/vger.kernel.org/selinux/0.git
...

Now, I would expect that pasta init takes that information into account, clones all repositories, initializes all the PaSta-related caches, and does the analysis. For that purpose, it needs to understand the data in the public inbox git repositories.

And pasta update updates all repositories and re-runs the analysis on the new data in the git repository and public inbox repositories.

@rralf I am just formulating and collecting the ideas here; there is no need for you to immediately implement that. Your expert opinion is of course welcome.

@mszczepankiewicz This might be a good first task to understand the PAStA sources and implement a first valuable feature. What do you think?

A related follow-up task would be to provide a step-by-step tutorial how to set up the system, do the analysis initially and redo the analysis when the repositories update and how to access the information through the web frontend.

rralf commented 6 years ago

Yeah, I totally agree with your recommendation. I admit that Pasta's usage is everything else but straight forward… Pasta was developed as a tool to quantify mainlining efforts of out-of-tree developments, such as Preempt_RT, vendor trees, … Later, I added the mailbox feature. Now it's a happy mixture of both.

However, I like the idea of specifying multiple repositories / mailboxes in the config file of a project. Currently, only one repository can be specified per project, multiple mailboxes are already supported. I didn't know the public-inboxes of the kernel, thanks for mentioning! They have a pretty curious way of storing things, though. It should only require little effort to support them. .oO(I could think of some helper scripts to either unroll them to a Pasta-suitable format, or the other way round: convert conventional mboxes to the git format, and support those. Heavily depends on performance.)

As you mentioned, a simple pasta init + pasta update would really be a big value! It should be possible to substitute the init / update with a sequence of pasta commands, so this should probably the last step we should target.

Having the big picture in mind, these would be more fine-granular steps:

1. Config files are currently called resources/project/project.cfg, which is a bit redundant. Let's rename them consistently to resources/project/config. The active configuration is currently linked to Pasta's root directory. It's probably easier to select the active configuration with a simple config file in the Pasta root directory that points to the active configuration. (e.g., echo linux > config or pasta select linux) Implemented on next in 08aba0b4c17b88fec6.

For analysis, Pasta currently comes with a -mbox switch to turn on mailbox analysis mode. Let's get rid of this switch, and specify the mailbox mode in the global project configuration.
PaStA has several analysis modes (representative, successive and upstream mode). Representative mode compares representative patches of different equivalence classes (in order to merge them later), upstream mode compares representatives against upstream, and successive is a special mode that compares only 'successive' releases of patch stacks (e.g., two successive versions of Preempt_RT). I will have to rethink those modes, maybe we can somehow simplify this.
Here comes the fun part: Add support for multiple upstream repositories in the configuration file. For each remote, allow to specify the branches of interest. A vast majority of commits across different remotes and/or branches will be the same, so we only need to track the mapping commit -> remote/branch. Or maybe we do it on-the-fly? Don't know yet. How to deal with remotes that a frequently rebased, reordered or restructures? (e.g., linux-next or maintainer trees). We need to discuss this. (BTW, this is the reason why I only analyse Linus' master tree where commits are stable)
Allow to specify multiple mbox dumps in the configuration. This is currently done via command line option (pasta mbox_add). Additionally, store some meta-information on the mbox dump (checksum?) in order to automatically detect if the dump contains new emails.
Add support for LKML's public inboxes
Rethink pasta subcommands, and add some layers of convenience, as Lukas suggested.

What do you think?

Thanks, Ralf

rralf commented 5 years ago

There's a (still WIP) implementation on next, up to step 6. Public inboxes were added as subrepositories to the pasta-resources repo. There's a new command pasta sync, which consolidates pasta cache and pasta mbox_add. If invoked with -mbox, it wil automatically update the public inboxes, synchronise indexes and forward caches.

I think this is the right way for making things easier, but there's still much stuff to do for easier workflow, so I'm leaving this issue open for the moment.

Thanks Ralf

rralf commented 5 years ago

Let's close this issue. Public inboxes are supported and pasta_sync makes updating repositories and mailboxes and creating caches easier.

lfd / PaStA

Support mailbox import from public-inbox (lore.kernel.org mailing list archives) #3