Feature: Custom validator to validate data before database queries

k1nho commented 1 year ago

Type of feature

🍕 Feature

Current behavior

The current implementation of handleRequest does not validate that a URL actually exists which leads to bad data being stored to the database, i.e when InsertRepository gets called.

func (p PizzaOvenServer) handleRequest(w http.ResponseWriter, r *http.Request) {
    if r.Method != http.MethodPost {
        p.Logger.Errorf("Received request with invalid method: %v", r.Body)
        http.Error(w, "Invalid request method, expected post", http.StatusMethodNotAllowed)
        return
    }

    var data reqData
    err := json.NewDecoder(r.Body).Decode(&data)
    if err != nil {
        p.Logger.Errorf("Could not decode request json body: %v with error: %v", r.Body, err)
        http.Error(w, "Could not decode request body", http.StatusBadRequest)
        return
    }
        // no validation
    err = p.processRepository(data.URL)
    if err != nil {
        p.Logger.Errorf("Could not process repository input: %v with error: %v", r.Body, err)
        http.Error(w, "Could not process input", http.StatusInternalServerError)
        return
    }
}

It would be great to have a custom validator package that runs through the necessary constraints of the data before doing the database queries. Ideally, this one should be extensible to cover other data types in the future.

Additional context

No response

Code of Conduct

[X] I agree to follow this project's Code of Conduct

Contributing Docs

[X] I agree to follow this project's Contribution Docs

bdougie commented 1 year ago

One thing to consider is we need to allow https://gitlab.com/gitlab/gitlab as well as other cloneable git remote URLs.

We will can add a GitHub validator, but this will lead to the need for adding more known remote URLs.

Remote Git Provider URLs in need of matchers

github
gitlab
bitbucket
AWS (? What is that called)
githubenterprise
Probably Linux distros (not familiar with that pattern)

The above can be added as needed, and does not need to be all scope into the initial validation.

jpmcb commented 1 year ago

Thanks for opening this feature request!

I think having a validator based on any git service provider is too much of a constraint. Think about how git works: it can clone repos from any git service provider or even just random file storage systems out on the internet (like S3 or via ssh). Or even clone across the filesystem on locally attached disk space. What happens if I want to process some open source metrics from an open source project that we can't possibly have the foresight to know is out on the web? For example, what if I host some code at code@johncodes.com:my-awesome-opensource-code? That would fail almost any validator we would put in place.

Additionally, in the future, we may enable local disk mounts to be used with this service. This would work really really well for a local developer flow or in an environment that does not have network access.

In my opinion, if calls to

repo, err := git.PlainOpen(g.path)

or

_, err = git.PlainClone(pathKey, false, &git.CloneOptions{
    URL:  key,
    Tags: git.NoTags,
})

fail, much like normal git clone or git fetch might fail, this is validation enough.

Or in otherwords, if it can be cloned and if it can be opened/fetched in code, it's a valid git repo.

This feedback loop is very short and the service doesn't make any type of database request before it clones/opens/fetches a repo. So we shouldn't worry about some invalid data try

k1nho commented 1 year ago

What happens if I want to process some open source metrics from an open source project that we can't possibly have the foresight to know is out on the web? For example, what if I host some code at code@johncodes.com:my-awesome-opensource-code? That would fail almost any validator we would put in place.

Ohhh I see I initially only though of Github xD, this makes sense, thanks!

We will can add a GitHub validator, but this will lead to the need for adding more known remote URLs.

@bdougie yep, I now see the issue with that, I think what @jpmcb proposed is the best solution.

Or in otherwords, if it can be cloned and if it can be opened/fetched in code, it's a valid git repo.

Yep, I think this is the best way for validation of the URL. Though is there a better way that does not require clone of the repo itself for remote repositories, something like git ls-remote equivalent in the go-git libraries, the current order of DB calls in processRepository guarantee a write to the repos table even if the URL is not a valid repo, any ideas ?

The call to write to repos can also be moved after the git clone, then I think validation is not necessary since, it will return if it fails to clone.

Should I open another issue for the URL validation?

jpmcb commented 1 year ago

Though is there a better way that does not require clone of the repo itself for remote repositories, something like git ls-remote equivalent in the go-git libraries

I think that's a good call. Something that gives us confidence before attempting to clone that the repo is real. As long as it's not an expensive call, I think that would work.

Let's continue the conversation in https://github.com/open-sauced/pizza/issues/29 - I think the solution will be to have some sort of validator that ensures the repo is cloneable, is a git repo, and is a valid url before operations proceed.

open-sauced / pizza