open-services-group / community

This repository handles a few common things, it is mainly used by our bots...
GNU General Public License v3.0
8 stars 24 forks source link

[5pt]As a thoth guidance service user, I want to figure out the repo/org, I want to create service for and to collect github data for respective repos. #173

Closed suppathak closed 2 years ago

suppathak commented 2 years ago

Persona / User

Thoth Guidance Service User

Reason

Related to #147

Define Done

suppathak commented 2 years ago

/assign @oindrillac /assign @chauhankaranraj /assign @suppathak

oindrillac commented 2 years ago

The thoth-station/support repo doesnt have enough PRs for us to train this model on.

Thus, for retraining the model the 2 options that we are considering are:

  1. Using the data from aggregating all repos in the thoth-station org & using that to train the model. (Cons - org has a lot of repos, some of which may not be meaningful to this use-case)
  2. Finding a subset of repos from the org which will be meaningful to train on.

@Gkrumbach07 can you suggest a list of repos from the thoth-station org which could be representative of the contributor or maintainer behaviors of the thoth-station/support repo? We can think of excluding some repos like:

Another qs we had was, since this service is trying to help Thoth Guidance Service user, should we exclude bot PRs from the training data that we're feeding into the model that's being trained?

Gkrumbach07 commented 2 years ago

Currently the support repo isnt used that often, and the issues that do get resolved dont always have an attached PR. So maybe instead of time to merge a PR. We can also have time to close an issue. We can use the lifecycle labels to create a more accurate timeline too.

I agree that we can exclude bot made PRs and issues. so Issues that dont have a bot label.

As for what repos to train on, There is not a definite list of the repos that external users use to get support. Many times they will make an issue in the repo that is related to their issue. I would choose the repos that have been updated in the last year maybe, or by number of stars, or by number of issues created by non Thoth org account.

oindrillac commented 2 years ago

Thanks for the information and the suggestion @Gkrumbach07.

We can explore Time to close an Issue model as a Spike and see if thats something we can deliver. The approach would seem similar to the Time to Merge model, but we would need to performing eda, engineer features, and train a model particular to this use case. Opening an Issue for this.

I agree that we can exclude bot made PRs and issues. so Issues that dont have a bot label.

We will exclude these from our model development and as for the repos, we will start with filtering a list of repos by the criterion you mentioned earlier..

suppathak commented 2 years ago

Related https://github.com/aicoe-aiops/ocp-ci-analysis/issues/489

oindrillac commented 2 years ago

I would choose the repos that have been updated in the last year maybe, or by number of stars, or by number of issues created by non Thoth org account.

We used all 3 filtering criteria suggested (https://github.com/aicoe-aiops/ocp-ci-analysis/pull/490), here is the list of 105 repos that we get from all the repos in the thoth-station org that we will be including in our training data.

{'thoth-station/.github', 
 'thoth-station/adviser', 
 'thoth-station/aicoe-ci-pulp-upload-example', 
 'thoth-station/amun-api', 
 'thoth-station/amun-client', 
 'thoth-station/analyzer', 
 'thoth-station/ansible-role-argo-workflows', 
 'thoth-station/build-watcher', 
 'thoth-station/buildlog-parser', 
 'thoth-station/cleanup-job', 
 'thoth-station/cli-examples', 
 'thoth-station/common', 
 'thoth-station/core', 
 'thoth-station/cve-update-job', 
 'thoth-station/datasets', 
 'thoth-station/dependency-monkey', 
 'thoth-station/dependency-monkey-zoo', 
 'thoth-station/document-sync-job', 
 'thoth-station/fext', 
 'thoth-station/glyph', 
 'thoth-station/graph-backup-job', 
 'thoth-station/graph-refresh-job', 
 'thoth-station/graph-sync-job', 
 'thoth-station/help', 
 'thoth-station/httpd-aicoe-container', 
 'thoth-station/image-pusher', 
 'thoth-station/init-job', 
 'thoth-station/integration-tests', 
 'thoth-station/invectio', 
 'thoth-station/investigator', 
 'thoth-station/jupyter-nbrequirements', 
 'thoth-station/jupyterlab-requirements', 
 'thoth-station/jupyternb-build-pipeline', 
 'thoth-station/kebechet', 
 'thoth-station/lab', 
 'thoth-station/license-solver', 
 'thoth-station/management-api', 
 'thoth-station/messaging', 
 'thoth-station/metrics-exporter', 
 'thoth-station/mi', 
 'thoth-station/mi-scheduler', 
 'thoth-station/micropipenv', 
 'thoth-station/moldavite-api', 
 'thoth-station/notebooks', 
 'thoth-station/osiris', 
 'thoth-station/osiris-build-observer', 
 'thoth-station/package-analyzer', 
 'thoth-station/package-extract', 
 'thoth-station/package-releases-job', 
 'thoth-station/package-update-job', 
 'thoth-station/prescriptions', 
 'thoth-station/ps-cv', 
 'thoth-station/ps-ip', 
 'thoth-station/ps-nlp', 
 'thoth-station/pulp-metrics-exporter', 
 'thoth-station/pulp-operate-first-web', 
 'thoth-station/python', 
 'thoth-station/python-ssdeep', 
 'thoth-station/qeb-hwt', 
 'thoth-station/ray-ml-notebook', 
 'thoth-station/ray-ml-worker', 
 'thoth-station/ray-operator', 
 'thoth-station/report-processing', 
 'thoth-station/reporter', 
 'thoth-station/revsolver', 
 'thoth-station/s2i', 
 'thoth-station/s2i-generic-data-science-notebook', 
 'thoth-station/s2i-minimal-notebook', 
 'thoth-station/s2i-pytorch-notebook', 
 'thoth-station/s2i-scipy-notebook', 
 'thoth-station/s2i-tensorflow-gpu-notebook', 
 'thoth-station/s2i-tensorflow-notebook', 
 'thoth-station/s2i-thoth', 
 'thoth-station/search', 
 'thoth-station/selinon-api', 
 'thoth-station/selinon-worker', 
 'thoth-station/si-aggregator', 
 'thoth-station/si-bandit', 
 'thoth-station/slo-reporter', 
 'thoth-station/solver', 
 'thoth-station/solver-error-classfier', 
 'thoth-station/solver-errors-reporter', 
 'thoth-station/solver-project-url-job', 
 'thoth-station/source-management', 
 'thoth-station/srcops-testing', 
 'thoth-station/storages', 
 'thoth-station/support', 
 'thoth-station/sync-job', 
 'thoth-station/template-project', 
 'thoth-station/tensorflow-build-s2i', 
 'thoth-station/tensorflow-release-api', 
 'thoth-station/tensorflow-release-job', 
 'thoth-station/tensorflow-serving-build', 
 'thoth-station/thamos', 
 'thoth-station/thoth', 
 'thoth-station/thoth-application', 
 'thoth-station/thoth-github-action', 
 'thoth-station/thoth-ops-infra', 
 'thoth-station/thoth-pybench', 
 'thoth-station/thoth-station.github.io', 
 'thoth-station/user-api', 
 'thoth-station/website', 
 'thoth-station/workflow-helpers', 
 'thoth-station/workflows', 
 'thoth-station/zuul-config'}
suppathak commented 2 years ago

Related pull request: https://github.com/aicoe-aiops/ocp-ci-analysis/pull/495

PR data is uploaded to : "bucketname": "opf-datacatalog-morty"

Gkrumbach07 commented 2 years ago

/close

sesheta commented 2 years ago

@Gkrumbach07: Closing this issue.

In response to [this](https://github.com/open-services-group/community/issues/173#issuecomment-1142199000): >/close > Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.