italia / developers-italia-api

API for the developers.italia.it public software collection
https://api.developers.italia.it
GNU Affero General Public License v3.0
9 stars 6 forks source link

Improvements of exposed data #215

Closed claudiu-cristea closed 8 months ago

claudiu-cristea commented 10 months ago

I'm missing some information that allows me to fully understand an entry from https://api.developers.italia.it/v1/software:

The code platform

In the case of self-hosted GitLab or Bitbucket software it's difficult to understand the underlying technology. Is it GitLab or Bitbucket? What API should I use if I want to fetch more info about that project? This info is not part of publiccodeYml blob either. But https://github.com/italia/publiccode-crawler knows this information and it would be nice to be exposed on the same level as id, url, publiccodeYml, etc. For instance:

{
  "data": [
    {
      "id": "9c07d69d-f66e-44b7-93ff-f19509e47dcf",
      "platformType": "gitlab",
      "url": "https://riuso.comune.salerno.it/root/simel_2.git",
      ...
    },
    ...
  ],
  ...
}

Of course, platformType could be any of github, gitlab, bitbucket.

The project's full path

Let's take an hypothetical case, a GitLab self-hosted project having this URL: https://example.com/base/path/group1/group2/group3/project.git. Note that the GitLab instance is installed at https://example.com/base/path (in a subdirectory, relative to the domain).

If a consumer of https://api.developers.italia.it/v1/software API wants to understand which is the project full path (namespace and project), by extracting it from the URL, they will fail. That's because extracting the path is misleading. Most probably they will assume that everything that comes after the host is the project full path:

But this is wrong as the project's namespace is group1/group2/group3. Again, this information is missed also from publiccodeYml blob.

I think this info should be exposed. something like:

    {
      "id": "9c07d69d-f66e-44b7-93ff-f19509e47dcf",
      "fullPath": "group1/group2/group3/project"
      "url": "https://example.com/base/path/group1/group2/group3/project.git",
      ...
    },

Moreover, this info is already available and exposed, as I see, by the /software/{softwareId}/logs path. In this way, a consumer understands how to derive the base URL of the GitLab/Bitbucket self-hosted platform.

bfabio commented 10 months ago

Hey @claudiucristea,

First off, thanks for bringing this up! It's always great to see community members contributing ideas to improve the project. I've got a few reservations on the proposed changes:

I think all revolves around that you'd like to know the base URL and the API type in order to query it, but I believe that task of detecting the API or the underlying technology might be better suited for a dedicated library. For instance, in publiccode-crawler, we use go-vcsurl for this purpose which is limited to GitHub, GitLab (cloud or self-hosted) and Bitbucket, but can be expanded.

I think it's fair to think of developers-italia-api as being agnostic and knowing where the software is, but not giving assumptions on how you access it.

This way, instead of relying on a potentially outdated platformType, we can ensure real-time accuracy, providing insights into the hosting platform, its version, and other relevant metadata. This approach not only reduces the manual overhead, but also eliminates the need for clients to be up to date with our hypothetical symbolic names AND for them to implement the actual logic.

On top of that, most projects are on github.com or gitlab.com, so just by looking at the URL, we can tell where they are from.

Can you maybe reuse publiccode-crawler, or adapt it for your needs?

I hope these points resonate with your thoughts. If there's anything else in the proposed change that might need attention, I'd be happy to discuss further. Let's keep the collaboration going!

claudiu-cristea commented 10 months ago

@bfabio,

Thank you for reply. Few remarks:

I was looking to https://github.com/alranel/go-vcsurl and that I'm thinking on very similar approach in order to guess the API from the URL. Given all your points, I understand that, exposing the API type, will not going to be supported here. I see most of them valid points. Maybe some are debatable but, yes, that's it, I can live with maintaining my own API guesser.

I was looking at the code from https://github.com/alranel/go-vcsurl (though I have zero Go knowledge) but I can't find an answer to my 2nd point: detecting the code hosting platform URL. Or the other way around: detecting the project full-path out of the full URL. Of course, I'm referring to the situation when a GitLab instance is located not on the host root, but in a subdirectory. It seems to me that the code assumes that the code platform is installed directly under the host (which, I agree, are most of the cases). Maybe I'm missing something?

bfabio commented 9 months ago

I was looking at the code from https://github.com/alranel/go-vcsurl (though I have zero Go knowledge) but I can't find an answer to my 2nd point: detecting the code hosting platform URL. Or the other way around: detecting the project full-path out of the full URL. Of course, I'm referring to the situation when a GitLab instance is located not on the host root, but in a subdirectory. It seems to me that the code assumes that the code platform is installed directly under the host (which, I agree, are most of the cases). Maybe I'm missing something?

@claudiu-cristea you're spot on about go-vcsurl and publiccode-crawler currently assuming the code platform is right under the host. This is due to historical and practical reasons: namely, we never had to deal with that scenario :)

The thought was to potentially extend go-vcsurl or a similar library to manage these cases. It doesn't have to be go-vcsurl specifically; another library could work just as well. The key is to centralize the logic for detecting platform URLs and extracting project paths, making it more adaptable to different setups.

publiccode-crawler and/or go-vcsurl have to be extended regardless if we want new code hosting platforms (https://github.com/italia/publiccode-crawler/issues/132) or plain git URLs (https://github.com/italia/publiccode-crawler/issues/196)

I'm not an expert with PHP, but I think there is a way to load a Go library and call it with FFI?

claudiu-cristea commented 8 months ago

@bfabio, thank you for clarifying and sorry for late feedback

We took a slightly different approach because we're using code hosting platform plugins (e.g. GitHub plugin). So each plugin knows to determine if they are in business of handling a given URL. Then we're caching the result so next time we know which API to use.

Solved also the "GitLab installed under a sub-dir" by performing some additional HTTP requests but only when we have the non-standard case

Thank you again for support. Closing this issue

bfabio commented 8 months ago

@claudiu-cristea nice to know that approach makes sense, the plugins are kinda like the scanners in publiccode-crawler