src-d / go-git

Project has been moved to: https://github.com/go-git/go-git
https://github.com/go-git/go-git
Apache License 2.0
4.91k stars 541 forks source link

monorepo / concurrent foreach / repo processing pipelines #915

Closed roscopecoltran closed 6 years ago

roscopecoltran commented 6 years ago

Hi guys,

Hope you are all well !

I just have 3 quick questions:

1. Monorepo, scalable vcs filesystem What are the missing components, in your views, to make , from go-git.v4 and go-billy.v4, a mono repository management system ? a distributed filesystem (eg seaweedfs) ? a full text-index of the source code (eg. zoekt) ?

Ref:

2. Concurrent/Thread-safe iterator over worktree

So my 2 other questions are connected to the first one, as I would like to pre/post process source code of all my repos (>300 repos) with hookable plugins. It would include queues, workers and plugins triggered by filename, path or content matching (eg. gitleaks). It sounds pretty much like all your other awesome projects like borges or hercules. :-)

The pre-requisite is to have a faster and concurrent filepath Walker with a callback registry.

Also, I noticed that memfs from go-billy.v4 is not thread-safe. I will push a code example but just try to run your go script with the -race ... arg and you will see lots of data race if you use memfs. As I want to gain performances, and only index/process the content of some starred or owned repositories, that's why I focus on memory based filesystem.

So my question is, where would it be better to add this concurrent dir walker in go-git.v4 or go-billy.v4 ? In a new go-billy's plugin or the commit_worktree.go files ? The same as if I would like to add a fsnotify/filesystem watcher plugin on repo opened with go-git.v4 ?

Can you share any recommendations if I would like to do it right in a forked version of packages ?

Examples:

3. Node-based processing pipeline I would like to create a system with builtin-plugins or shared plugins and to trigger a pipeline of tasks based on some text based rules while iterating files from a repository opened/cloned/forked with go-git.v4.

An example of UI: Pipeline Video

Nb: it could be managed by an fbp based system like noflow or nodered. The video above is really for visualizing the task flow management interface for each specific scenarios.

Examples of post/pre processing tools:

So my last questions are:

Thanks for any insights about the questions above ^^.

Thanks for your time guys.

Cheers, Rosco

smola commented 6 years ago

Hi @roscopecoltran!

You are welcome to join #go-git channel on our Slack to discuss any of these issues.

I'll try to answer some:

What are the missing components, in your views, to make , from go-git.v4 and go-billy.v4, a mono repository management system? a distributed filesystem (eg seaweedfs)? a full text-index of the source code (eg. zoekt)?

I have no idea. We didn't think about building a mono repository management system. I have no good understanding of problems to be solved in that space. But if you have more concrete questions, maybe we had similar use cases.

The pre-requisite is to have a faster and concurrent filepath Walker with a callback registry.

Could you elaborate a bit more?

I would like to create a system with builtin-plugins or shared plugins and to trigger a pipeline of tasks based on some text based rules while iterating files from a repository opened/cloned/forked with go-git.v4.

Sounds great. Although that falls completely outside of the scope of go-git. We are happy to support you with whatever can be done in go-git to support this use case.

Have you already planned to create such pipeline/dataflow manager as you are processing lots of repositories?

We have currently no plans to do such thing.

Should I created a plugin/built-ins registry in go-billy?

I don't know what plugins would do in go-billy, since it is not a framework. Implementations are already pluggable. But feel free to share more details about the use case and we can consider it.