Closed claudiu-cristea closed 1 year ago
Hey @claudiucristea,
First off, thanks for bringing this up! It's always great to see community members contributing ideas to improve the project. I've got a few reservations on the proposed changes:
Introducing strings as symbolic names (github
, gitea
, etc.) for code hostings might add an extra layer of complexity. It means we'd have to maintain a list of these symbolic names and ensure they're consistent across the board.
Different versions of APIs or software for code hosting platforms like Gitea, GitLab, etc., can pose a challenge. How do we ensure compatibility and handle discrepancies between versions? We'd need more symbolic names for each of them (fe. gitlab-v3
, gitlab-v4
, etc.)
If a code hosting platform isn't supported, would we default to "other
"? This seems like it would bring us back to the initial problem, where we're not providing enough clarity.
Projects sometimes migrate from one code hosting platform to another for various reasons. If a project switches its hosting, the platformType
might not reflect the current state until the next crawl. This lag could lead to misinformation or confusion.
I think all revolves around that you'd like to know the base URL and the API type in order to query it, but I believe that task of detecting the API or the underlying technology might be better suited for a dedicated library. For instance, in publiccode-crawler, we use go-vcsurl for this purpose which is limited to GitHub, GitLab (cloud or self-hosted) and Bitbucket, but can be expanded.
I think it's fair to think of developers-italia-api
as being agnostic and knowing where the software is, but not giving assumptions on how you access it.
This way, instead of relying on a potentially outdated platformType
, we can ensure real-time accuracy, providing insights into the hosting platform, its version, and other relevant metadata. This approach not only reduces the manual overhead, but also eliminates the need for clients to be up to date with our hypothetical symbolic names AND for them to implement the actual logic.
On top of that, most projects are on github.com or gitlab.com, so just by looking at the URL, we can tell where they are from.
Can you maybe reuse publiccode-crawler, or adapt it for your needs?
I hope these points resonate with your thoughts. If there's anything else in the proposed change that might need attention, I'd be happy to discuss further. Let's keep the collaboration going!
@bfabio,
Thank you for reply. Few remarks:
I was looking to https://github.com/alranel/go-vcsurl and that I'm thinking on very similar approach in order to guess the API from the URL. Given all your points, I understand that, exposing the API type, will not going to be supported here. I see most of them valid points. Maybe some are debatable but, yes, that's it, I can live with maintaining my own API guesser.
I was looking at the code from https://github.com/alranel/go-vcsurl (though I have zero Go knowledge) but I can't find an answer to my 2nd point: detecting the code hosting platform URL. Or the other way around: detecting the project full-path out of the full URL. Of course, I'm referring to the situation when a GitLab instance is located not on the host root, but in a subdirectory. It seems to me that the code assumes that the code platform is installed directly under the host (which, I agree, are most of the cases). Maybe I'm missing something?
I was looking at the code from https://github.com/alranel/go-vcsurl (though I have zero Go knowledge) but I can't find an answer to my 2nd point: detecting the code hosting platform URL. Or the other way around: detecting the project full-path out of the full URL. Of course, I'm referring to the situation when a GitLab instance is located not on the host root, but in a subdirectory. It seems to me that the code assumes that the code platform is installed directly under the host (which, I agree, are most of the cases). Maybe I'm missing something?
@claudiu-cristea you're spot on about go-vcsurl
and publiccode-crawler
currently assuming the code platform is right under the host. This is due to historical and practical reasons: namely, we never had to deal with that scenario :)
The thought was to potentially extend go-vcsurl or a similar library to manage these cases. It doesn't have to be go-vcsurl specifically; another library could work just as well. The key is to centralize the logic for detecting platform URLs and extracting project paths, making it more adaptable to different setups.
publiccode-crawler and/or go-vcsurl have to be extended regardless if we want new code hosting platforms (https://github.com/italia/publiccode-crawler/issues/132) or plain git URLs (https://github.com/italia/publiccode-crawler/issues/196)
I'm not an expert with PHP, but I think there is a way to load a Go library and call it with FFI?
@bfabio, thank you for clarifying and sorry for late feedback
We took a slightly different approach because we're using code hosting platform plugins (e.g. GitHub plugin). So each plugin knows to determine if they are in business of handling a given URL. Then we're caching the result so next time we know which API to use.
Solved also the "GitLab installed under a sub-dir" by performing some additional HTTP requests but only when we have the non-standard case
Thank you again for support. Closing this issue
@claudiu-cristea nice to know that approach makes sense, the plugins are kinda like the scanners in publiccode-crawler
I'm missing some information that allows me to fully understand an entry from https://api.developers.italia.it/v1/software:
The code platform
In the case of self-hosted GitLab or Bitbucket software it's difficult to understand the underlying technology. Is it GitLab or Bitbucket? What API should I use if I want to fetch more info about that project? This info is not part of
publiccodeYml
blob either. But https://github.com/italia/publiccode-crawler knows this information and it would be nice to be exposed on the same level asid
,url
,publiccodeYml
, etc. For instance:Of course,
platformType
could be any ofgithub
,gitlab
,bitbucket
.The project's full path
Let's take an hypothetical case, a GitLab self-hosted project having this URL: https://example.com/base/path/group1/group2/group3/project.git. Note that the GitLab instance is installed at https://example.com/base/path (in a subdirectory, relative to the domain).
If a consumer of https://api.developers.italia.it/v1/software API wants to understand which is the project full path (namespace and project), by extracting it from the URL, they will fail. That's because extracting the path is misleading. Most probably they will assume that everything that comes after the host is the project full path:
base/path/group1/group2/group3
project
But this is wrong as the project's namespace is
group1/group2/group3
. Again, this information is missed also frompubliccodeYml
blob.I think this info should be exposed. something like:
Moreover, this info is already available and exposed, as I see, by the
/software/{softwareId}/logs
path. In this way, a consumer understands how to derive the base URL of the GitLab/Bitbucket self-hosted platform.