Store jenkins-infra/infra-statistics data in a public location

lemeurherve commented 1 week ago

Service(s)

stats.jenkins.io

Summary

To build its content, the new GSoC project https://github.com/jenkins-infra/stats.jenkins.io needs to retrieve the data generated when https://stats.jenkins.io is built on trusted.ci.jenkins.io (pipeline defined in https://github.com/jenkins-infra/infra-statistics).

How infra-statistics currently works:

It retrieves data from two VMs via rsync
- https://github.com/jenkins-infra/infra-statistics/blob/8209b583ee5be409d2dc8cdf780bd4358579b8de/Jenkinsfile#L50-L53
It ingests this data in a local MongoDB database
- https://github.com/jenkins-infra/infra-statistics/blob/8209b583ee5be409d2dc8cdf780bd4358579b8de/Jenkinsfile#L62-L66
Several Groovy scripts are transforming this data in the desired format(s) and generate JSON files
- https://github.com/jenkins-infra/infra-statistics/blob/8209b583ee5be409d2dc8cdf780bd4358579b8de/Jenkinsfile#L75-L86
These JSON files are used to generate the static HTML content of https://stats.jenkins.io

To allow https://github.com/jenkins-infra/stats.jenkins.io fetching this data, we need to add a step to publish these generated JSON files in a public location.

Note: this data is already public and published at https://stats.jenkins.io.

I was thinking about putting this data in reports.jenkins.io, do you have other suggestions?

Ref:

https://github.com/jenkins-infra/helpdesk/issues/4132#issuecomment-2191538603

Reproduction steps

No response

krisstern commented 1 week ago

c.c. @gounthar @Vandit1604 @krisstern

krisstern commented 1 week ago

Putting the new data in reports.jenkins.io sounds like a good idea to me

dduportal commented 1 week ago

I was thinking about putting this data in reports.jenkins.io, do you have other suggestions?

Excellent idea. LGTM!

On the infrastructure side, note that the report will have to be generated from trusted.ci: I don't know if it has everything required to publish to reports.jenkins.io (I would expect so, but worth checking before doing any work). The reason is that we do not want to store the SSH accesses in infra.ci.jenkins.io at all (security concern).

lemeurherve commented 1 week ago

trusted.ci.jenkins.io can publish on reports.jenkins.io:

It is one of the authorized instances of publishReports shared pipeline library function: https://github.com/jenkins-infra/pipeline-library/blob/161c28976a3a09f449bd92bb753bd06e3a8ad640/vars/publishReports.groovy#L11-L13
It has the credentials used by publishReports: https://github.com/jenkins-infra/pipeline-library/blob/161c28976a3a09f449bd92bb753bd06e3a8ad640/vars/publishReports.groovy#L15

lemeurherve commented 1 week ago

I was wrong in my analysis of how it works, I think this issue can be closed as all data is stored in the gh-pages branch of infra-statistics repository: https://github.com/jenkins-infra/infra-statistics/tree/gh-pages thus already publicly available.

Ex: https://github.com/jenkins-infra/infra-statistics/commit/e5551a16c7c4054b523fe6578b878387654979ee

krisstern commented 1 week ago

Sure, I think we have been fetching data from https://github.com/jenkins-infra/infra-statistics/tree/gh-pages/jenkins-stats previously.

lemeurherve commented 6 days ago

Reopening without milestone to discuss how to fetch this data efficiently.

lemeurherve commented 6 days ago

From @shlomomdahan in the GSoC Slack channel:

Is the most efficient way to do this by doing Axios requests to the raw GH page? I noticed it becomes quite slow when fetching data for 1000+ plugins

lemeurherve commented 6 days ago

I suggested him to fetch this data on build time (CI) instead of on run time (client side).

Note about the data currently in infra-statistics used for https://stats.jenkins.io:

8400 files
200Mo
Manually updated about once per month

As explained in https://github.com/jenkins-infra/helpdesk/issues/4132#issuecomment-2168541420, we can't assess https://stats.jenkins.io current traffic as it's hosted on GitHub Pages.

lemeurherve commented 6 days ago

If this suggestion is pursued, infra-statistics repository could be added as git submodule of stats.jenkins.io, then stats.jenkins.io pipeline checkout could be configured to retrieve its content (and thus the data) alongside stats.jenkins.io repository content.

This have the advantage of requiring no new infrastructure, and minimal change to the existing pipeline.

lemeurherve commented 5 days ago

After some tests and realizing that we can't use the -b (branch) argument in combination with --depth 1 for git submodule add (https://git-scm.com/docs/git-submodule), I propose to change its default branch from main to gh-pages (with an indication to consult the main branch for the complete README) to avoid having to clone the entire infra-statistics repository as submodule.

Something like https://github.com/lemeurherve/infra-statistics

WDYT?

Comparison

Current:

$ git submodule add -b gh-pages -- https://github.com/jenkins-infra/infra-statistics.git
Cloning into '/Users/veve/j-infra/_stats/infra-statistics'...
remote: Enumerating objects: 351731, done.
remote: Counting objects: 100% (64308/64308), done.
remote: Compressing objects: 100% (10302/10302), done.
remote: Total 351731 (delta 54007), reused 64304 (delta 54006), pack-reused 287423
Receiving objects: 100% (351731/351731), 783.99 MiB | 23.88 MiB/s, done.
Resolving deltas: 100% (310805/310805), done.

With my proposition:

$ git submodule add --depth 1 -- https://github.com/lemeurherve/infra-statistics.git                            
Cloning into '/Users/veve/j-infra/_stats/infra-statistics'...
remote: Enumerating objects: 7996, done.
remote: Counting objects: 100% (7996/7996), done.
remote: Compressing objects: 100% (4307/4307), done.
remote: Total 7996 (delta 4921), reused 5623 (delta 3683), pack-reused 0
Receiving objects: 100% (7996/7996), 31.14 MiB | 16.33 MiB/s, done.
Resolving deltas: 100% (4921/4921), done.

-b & --depth 1 incompatibility (expected in retrospect):

$ git submodule add -b gh-pages --depth 1 -- https://github.com/jenkins-infra/infra-statistics.git
Cloning into '/Users/veve/j-infra/_stats/infra-statistics'...
remote: Enumerating objects: 26, done.
remote: Counting objects: 100% (26/26), done.
remote: Compressing objects: 100% (24/24), done.
remote: Total 26 (delta 1), reused 9 (delta 0), pack-reused 0
Receiving objects: 100% (26/26), 28.75 MiB | 27.93 MiB/s, done.
Resolving deltas: 100% (1/1), done.
fatal: 'origin/gh-pages' is not a commit and a branch 'gh-pages' cannot be created from it
fatal: unable to checkout submodule 'infra-statistics'

lemeurherve commented 5 days ago

(example of https://github.com/lemeurherve/infra-statistics moved to the comment above)

krisstern commented 5 days ago

I propose to change its default branch from main to gh-pages (with an indication to consult the main branch for the complete README) to avoid having to clone the entire infra-statistics repository as submodule.

I have no objection to this. But we will need to make a note in the "README.md" file to make things clear.

dduportal commented 5 days ago

If this suggestion is pursued, infra-statistics repository could be added as git submodule of stats.jenkins.io, then stats.jenkins.io pipeline checkout could be configured to retrieve its content (and thus the data) alongside stats.jenkins.io repository content.

This have the advantage of requiring no new infrastructure, and minimal change to the existing pipeline.

Good idea to use a submodule!

Given how « annoying » submodule seems to be, git subtree might be a good tool for this: https://www.atlassian.com/git/tutorials/git-subtree

krisstern commented 3 days ago

@lemeurherve Are you intended to open a PR for adding the submodule shortly? Or are you waiting for us to take the initiative to open a PR for this and to have you as a reviewer? This is currently a blocker for the project I believe and it may be getting in the way of further progress at this critical stage.

lemeurherve commented 2 days ago

@krisstern I intend to open a PR to deal with that. Should be today or tomorrow.

lemeurherve commented 2 days ago

After discussing the subject with @dduportal & @smerle33 we found an even simpler solution than git submodule or git subtree: retrieving the archive of the desired branch from https://github.com/jenkins-infra/infra-statistics/archive/refs/heads/gh-pages.zip

Preparing the PR.

krisstern commented 2 days ago

Thanks @lemeurherve! Appreciate it

lemeurherve commented 2 days ago

PR opened: https://github.com/jenkins-infra/stats.jenkins.io/pull/67

krisstern commented 2 days ago

Thanks once again @lemeurherve!

lemeurherve commented 1 day ago

@krisstern @shlomomdahan it looks like to me that this help desk issue can be closed, OK for you?

krisstern commented 1 day ago

Yup, okay for me to close this issue.

jenkins-infra / helpdesk