Closed k1nho closed 1 year ago
One thing to consider is we need to allow https://gitlab.com/gitlab/gitlab as well as other cloneable git remote URLs.
We will can add a GitHub validator, but this will lead to the need for adding more known remote URLs.
The above can be added as needed, and does not need to be all scope into the initial validation.
Thanks for opening this feature request!
I think having a validator based on any git service provider is too much of a constraint. Think about how git
works: it can clone repos from any git service provider or even just random file storage systems out on the internet (like S3 or via ssh). Or even clone across the filesystem on locally attached disk space. What happens if I want to process some open source metrics from an open source project that we can't possibly have the foresight to know is out on the web? For example, what if I host some code at code@johncodes.com:my-awesome-opensource-code
? That would fail almost any validator we would put in place.
Additionally, in the future, we may enable local disk mounts to be used with this service. This would work really really well for a local developer flow or in an environment that does not have network access.
In my opinion, if calls to
repo, err := git.PlainOpen(g.path)
or
_, err = git.PlainClone(pathKey, false, &git.CloneOptions{
URL: key,
Tags: git.NoTags,
})
fail, much like normal git clone
or git fetch
might fail, this is validation enough.
Or in otherwords, if it can be cloned and if it can be opened/fetched in code, it's a valid git repo.
This feedback loop is very short and the service doesn't make any type of database request before it clones/opens/fetches a repo. So we shouldn't worry about some invalid data try
What happens if I want to process some open source metrics from an open source project that we can't possibly have the foresight to know is out on the web? For example, what if I host some code at code@johncodes.com:my-awesome-opensource-code? That would fail almost any validator we would put in place.
Ohhh I see I initially only though of Github xD, this makes sense, thanks!
We will can add a GitHub validator, but this will lead to the need for adding more known remote URLs.
@bdougie yep, I now see the issue with that, I think what @jpmcb proposed is the best solution.
Or in otherwords, if it can be cloned and if it can be opened/fetched in code, it's a valid git repo.
Yep, I think this is the best way for validation of the URL. Though is there a better way that does not require clone of the repo itself for remote repositories, something like git ls-remote
equivalent in the go-git libraries, the current order of DB calls in processRepository guarantee a write to the repos table even if the URL is not a valid repo, any ideas ?
The call to write to repos can also be moved after the git clone, then I think validation is not necessary since, it will return if it fails to clone.
Should I open another issue for the URL validation?
Though is there a better way that does not require clone of the repo itself for remote repositories, something like git ls-remote equivalent in the go-git libraries
I think that's a good call. Something that gives us confidence before attempting to clone that the repo is real. As long as it's not an expensive call, I think that would work.
Let's continue the conversation in https://github.com/open-sauced/pizza/issues/29 - I think the solution will be to have some sort of validator that ensures the repo is cloneable, is a git repo, and is a valid url before operations proceed.
Type of feature
🍕 Feature
Current behavior
The current implementation of handleRequest does not validate that a URL actually exists which leads to bad data being stored to the database, i.e when InsertRepository gets called.
It would be great to have a custom validator package that runs through the necessary constraints of the data before doing the database queries. Ideally, this one should be extensible to cover other data types in the future.
Suggested solution
Create a validator package that handles data validation prior to database calls, and returns formatted JSON errors encountered. It should provide the necessary utilities for evaluating validation and sending errors, if any where encountered.
Additional context
No response
Code of Conduct
Contributing Docs