Closed lemeurherve closed 1 day ago
c.c. @gounthar @Vandit1604 @krisstern
Putting the new data in reports.jenkins.io
sounds like a good idea to me
I was thinking about putting this data in reports.jenkins.io, do you have other suggestions?
Excellent idea. LGTM!
On the infrastructure side, note that the report will have to be generated from trusted.ci: I don't know if it has everything required to publish to reports.jenkins.io (I would expect so, but worth checking before doing any work). The reason is that we do not want to store the SSH accesses in infra.ci.jenkins.io at all (security concern).
trusted.ci.jenkins.io can publish on reports.jenkins.io:
publishReports
shared pipeline library function: https://github.com/jenkins-infra/pipeline-library/blob/161c28976a3a09f449bd92bb753bd06e3a8ad640/vars/publishReports.groovy#L11-L13publishReports
: https://github.com/jenkins-infra/pipeline-library/blob/161c28976a3a09f449bd92bb753bd06e3a8ad640/vars/publishReports.groovy#L15
I was wrong in my analysis of how it works, I think this issue can be closed as all data is stored in the gh-pages
branch of infra-statistics repository: https://github.com/jenkins-infra/infra-statistics/tree/gh-pages thus already publicly available.
Ex: https://github.com/jenkins-infra/infra-statistics/commit/e5551a16c7c4054b523fe6578b878387654979ee
Sure, I think we have been fetching data from https://github.com/jenkins-infra/infra-statistics/tree/gh-pages/jenkins-stats previously.
Reopening without milestone to discuss how to fetch this data efficiently.
From @shlomomdahan in the GSoC Slack channel:
Is the most efficient way to do this by doing Axios requests to the raw GH page? I noticed it becomes quite slow when fetching data for 1000+ plugins
I suggested him to fetch this data on build time (CI) instead of on run time (client side).
Note about the data currently in infra-statistics used for https://stats.jenkins.io:
As explained in https://github.com/jenkins-infra/helpdesk/issues/4132#issuecomment-2168541420, we can't assess https://stats.jenkins.io current traffic as it's hosted on GitHub Pages.
If this suggestion is pursued, infra-statistics repository could be added as git submodule of stats.jenkins.io, then stats.jenkins.io pipeline checkout could be configured to retrieve its content (and thus the data) alongside stats.jenkins.io repository content.
This have the advantage of requiring no new infrastructure, and minimal change to the existing pipeline.
After some tests and realizing that we can't use the -b
(branch) argument in combination with --depth 1
for git submodule add
(https://git-scm.com/docs/git-submodule), I propose to change its default branch from main
to gh-pages
(with an indication to consult the main branch for the complete README) to avoid having to clone the entire infra-statistics repository as submodule.
Something like https://github.com/lemeurherve/infra-statistics
WDYT?
Current:
$ git submodule add -b gh-pages -- https://github.com/jenkins-infra/infra-statistics.git
Cloning into '/Users/veve/j-infra/_stats/infra-statistics'...
remote: Enumerating objects: 351731, done.
remote: Counting objects: 100% (64308/64308), done.
remote: Compressing objects: 100% (10302/10302), done.
remote: Total 351731 (delta 54007), reused 64304 (delta 54006), pack-reused 287423
Receiving objects: 100% (351731/351731), 783.99 MiB | 23.88 MiB/s, done.
Resolving deltas: 100% (310805/310805), done.
With my proposition:
$ git submodule add --depth 1 -- https://github.com/lemeurherve/infra-statistics.git
Cloning into '/Users/veve/j-infra/_stats/infra-statistics'...
remote: Enumerating objects: 7996, done.
remote: Counting objects: 100% (7996/7996), done.
remote: Compressing objects: 100% (4307/4307), done.
remote: Total 7996 (delta 4921), reused 5623 (delta 3683), pack-reused 0
Receiving objects: 100% (7996/7996), 31.14 MiB | 16.33 MiB/s, done.
Resolving deltas: 100% (4921/4921), done.
-b
& --depth 1
incompatibility (expected in retrospect):
$ git submodule add -b gh-pages --depth 1 -- https://github.com/jenkins-infra/infra-statistics.git
Cloning into '/Users/veve/j-infra/_stats/infra-statistics'...
remote: Enumerating objects: 26, done.
remote: Counting objects: 100% (26/26), done.
remote: Compressing objects: 100% (24/24), done.
remote: Total 26 (delta 1), reused 9 (delta 0), pack-reused 0
Receiving objects: 100% (26/26), 28.75 MiB | 27.93 MiB/s, done.
Resolving deltas: 100% (1/1), done.
fatal: 'origin/gh-pages' is not a commit and a branch 'gh-pages' cannot be created from it
fatal: unable to checkout submodule 'infra-statistics'
(example of https://github.com/lemeurherve/infra-statistics moved to the comment above)
I propose to change its default branch from main to gh-pages (with an indication to consult the main branch for the complete README) to avoid having to clone the entire infra-statistics repository as submodule.
I have no objection to this. But we will need to make a note in the "README.md" file to make things clear.
If this suggestion is pursued, infra-statistics repository could be added as git submodule of stats.jenkins.io, then stats.jenkins.io pipeline checkout could be configured to retrieve its content (and thus the data) alongside stats.jenkins.io repository content.
This have the advantage of requiring no new infrastructure, and minimal change to the existing pipeline.
Good idea to use a submodule!
Given how « annoying » submodule seems to be, git subtree might be a good tool for this: https://www.atlassian.com/git/tutorials/git-subtree
@lemeurherve Are you intended to open a PR for adding the submodule shortly? Or are you waiting for us to take the initiative to open a PR for this and to have you as a reviewer? This is currently a blocker for the project I believe and it may be getting in the way of further progress at this critical stage.
@krisstern I intend to open a PR to deal with that. Should be today or tomorrow.
After discussing the subject with @dduportal & @smerle33 we found an even simpler solution than git submodule or git subtree: retrieving the archive of the desired branch from https://github.com/jenkins-infra/infra-statistics/archive/refs/heads/gh-pages.zip
Preparing the PR.
Thanks @lemeurherve! Appreciate it
Thanks once again @lemeurherve!
@krisstern @shlomomdahan it looks like to me that this help desk issue can be closed, OK for you?
Yup, okay for me to close this issue.
Service(s)
stats.jenkins.io
Summary
To build its content, the new GSoC project https://github.com/jenkins-infra/stats.jenkins.io needs to retrieve the data generated when https://stats.jenkins.io is built on trusted.ci.jenkins.io (pipeline defined in https://github.com/jenkins-infra/infra-statistics).
How infra-statistics currently works:
To allow https://github.com/jenkins-infra/stats.jenkins.io fetching this data, we need to add a step to publish these generated JSON files in a public location.
Note: this data is already public and published at https://stats.jenkins.io.
I was thinking about putting this data in reports.jenkins.io, do you have other suggestions?
Ref:
Reproduction steps
No response