mozilla / participation-metrics-org

Participation metrics planning repository
4 stars 4 forks source link

Send list of available indexes to A&T #193

Open canasdiaz opened 5 years ago

canasdiaz commented 5 years ago

Our A&T folks need a list of the endpoints (credentials are already send) and the indexes they have to query. In order to make their life easier Bitergia will provide a small explanation of each index and links to documentation if available.

alpgarcia commented 5 years ago

The available aliases are listed below. Documentation is partially available, so I'll try to add a brief description of each index to give you a quick glance of what they contain. Please ask us whatever you may need. Those aliases with more complete documentation are linked to those docs.

Some general notes:

GET _cat/aliases/affiliations?v



| alias | description |
|---|---|
| [git](https://github.com/chaoss/grimoirelab-elk/blob/master/schema/git.csv) | Each document corresponds to a commit. |
| [affiliations](https://github.com/chaoss/grimoirelab-elk/blob/master/schema/affiliations.csv) | Groups indexes corresponding to single data sources to allow building visualizations in Kibana that contain info from all of them, see [https://sotrar.mozilla.community/app/kibana#/dashboard/Affiliations](Affiliations panel) under Community entry in the top menu. |
| [all_onion](https://github.com/chaoss/grimoirelab-elk/blob/master/schema/onion.csv) | Groups all indexes containing pre-computed data based on onion analysis, see [Community -> Overall panel](https://sotrar.mozilla.community/app/kibana#/dashboard/Overall-Community-Structure). See also [panel documentation](https://chaoss.github.io/grimoirelab-sigils/panels/overall-community-structure/) to know more about it. |
| bugzilla | Each document correspond to  an issue in Bugzilla, with info about the creator and the assignee. |
| demographics | Groups indexes that contain information for building [demographics panel](https://sotrar.mozilla.community/app/kibana#/dashboard/15e4b020-d075-11e8-8aac-ef7fd4d8cbad). See [https://chaoss.github.io/grimoirelab-sigils/panels/demographics/](panel documentation) to get some details, but basically for Mozilla it points only to Git index, which includes a couple of fields `demography_min_date` and `demography_max_date` not listed in its CSV ([pull request sent](https://github.com/chaoss/grimoirelab-elk/pull/580)) storing the dates of the first and the last contribution made for the corresponding author.
| discourse | Contains questions, answers and accepted answers as separate elements. Explore [Discover tab](https://sotrar.mozilla.community/goto/a6640ddcbe0573e46c06a6c8ebc141a8) to understand a bit better how it is built. |
| [git-emergintech](https://github.com/chaoss/grimoirelab-elk/blob/master/schema/git.csv) | Same as `git` but filtered for emerging tech repositories only. |
| [git_areas_of_code](https://github.com/chaoss/grimoirelab-elk/blob/master/schema/areas_of_code.csv) | Built from the same RAW information as Git, but focused at the level of file, see [Areas of Code panel](https://sotrar.mozilla.community/app/kibana#/dashboard/Git-Areas-of-Code) to have an idea of the info we have there. Each document is a file instead of being a commit as it is in Git index. |
| [github_issues](https://github.com/chaoss/grimoirelab-elk/blob/master/schema/github_issues.csv) | Documents can be either `issues` or `pull requests`. Use boolean field `pull_request` for filtering. |
| github-emergingtech | Same as `github_issues` but filtered for emerging tech repositories only. |
| meetup | Documents can be meetups, comments or rsvps. There are three fields named `ìs_meetup_<event_type>` of type long, which value will be 1 for the corresponding `<event_type> = ( meetup OR comment OR rsvp )`. There is also a `type` String field containing this information. See Meetup panels in the Dashboard to get some examples on how we use these fields. For instance see [meetups table](https://sotrar.mozilla.community/goto/f719b7e88b1c75c6f5a0a07544da54d1), built on top of a search for filtering in only meetups.|
| meetup-emergingtech | Same as `meetup` but filtered for emerging tech meetups only. |
| remo-activities | Contains information about Mozilla activities. See [Activities panel](https://sotrar.mozilla.community/app/kibana#/dashboard/Reps-Activities) to get an idea of what kind of information it contains. In short, it contains information about participants and mentors from the point of view of the activities. |
| remo-events | Contains information about Mozilla events. Each document corresponds to an event. See [Mozilla Reps, Events panel](https://sotrar.mozilla.community/app/kibana#/dashboard/Reps-Events) to have a look at the information we store in the index. |
| stackoverflow | Documents are questions and answers. Some info is shared in both, questions and their corresponding answers in order to be able to correctly filter in the dashboard. Questions also contains fields like `answer_count` to store the number of answers in order to easily show it in the panels. Accepted answers are marked using `ìs_accepted` boolean field and `is_accepted_answer` long field (used for counts).  |
| stackoverflow-emergingtech | Same as `stackoverflow` but filtered for emerging tech. |

I'll keep this open to let you use this issue to ask us whatever else you may need.
havardl commented 5 years ago

Hey @alpgarcia, thank you for this!

One quick question about the github issue index. Is it possible to find a thread of replies to an issue or how is that data structured?

alpgarcia commented 5 years ago

@havardl

One quick question about the github issue index. Is it possible to find a thread of replies to an issue or how is that data structured?

It is not in the enriched index, but in the RAW index that is also available in the server. You can query it as follows:

get github_mozilla_180322/_search
{
  "query": {
    "match_all": {}
  }
}

And you'll get a JSON with all the info, even if we are not currently using that info in the enriched indexes. In this case you'll see something like:

"hits": [
      {
        "_index": "github_mozilla_180322",
        "_type": "items",
        "_id": "1a50caa1b3f70e638cc78b8ce9f5605462c396dc",
        "_score": 1,
        "_source": {
          "origin": "https://github.com/rust-lang/rust-www",
          "uuid": "1a50caa1b3f70e638cc78b8ce9f5605462c396dc",
          ...
          "data": {
            "user_data": {
             ...
            },
            "body": "",
            "milestone": null,
            ...
            "comments_data": [
              {
                "issue_url": "https://api.github.com/repos/rust-lang/rust-www/issues/319",
                "body": "Oh oops, thanks!\n",
                "created_at": "2016-03-11T20:12:19Z",
               ...

From here:

edin-ogtal commented 5 years ago

@alpgarcia: According to this documentation above the git index should contain an "author_min_date" and "author_max_date" field. I don't seem to be able to find this when using the Discover tab on Kibana to view what fields are available. Can you help me figure out whether or not these fields are (or will be) available for querying?

canasdiaz commented 5 years ago

@alpgarcia: According to this documentation above the git index should contain an "author_min_date" and "author_max_date" field. I don't seem to be able to find this when using the Discover tab on Kibana to view what fields are available. Can you help me figure out whether or not these fields are (or will be) available for querying?

My proposal @alpgarcia is to create a different ticket for this. Your guess is right @edin-ogtal , this field must be present in the index. If you don't find it then we (Bitergia) have to fix it.

alpgarcia commented 5 years ago

@sanacl, @edin-ogtal my bad, I forgot to remove those fields. They are the old versions of demography_min_date and demography_max_date.

Anyway @sanacl is right and it seems to be a problem with the process in charge of adding those fields, that should run after the enriched index is completed. Not sure if it is related to having incomplete information in Git index and then the process simply did not start yet or is due to an error (if this is the case, I agree to open a separate ticket to track it).

edin-ogtal commented 5 years ago

@alpgarcia I have a question regarding the grimoire_creation_date field which seems the be present in all indices. In the git index it corresponds the commit date, but I am unsure of what this fields corresponds to in say bugzilla, discourse, etc. Can you help clarify?

alpgarcia commented 5 years ago

Yep, @edin-ogtal,

at some point we decided to have a single field representing creation date of items in their corresponding sources across all different indexes. We decided to use grimoire as prefix because is the name of the project. That means for tickets, it should store the date a given ticket was created in the original ticketing system, for a post, its creation date in the corresponding forum, etc.

alpgarcia commented 5 years ago

@havardl, @edin-ogtal

do you think we can close this task?

(of course you can directly asking us or opening new issues if you need more information on a particular topic, no matter the state of this particular one)

havardl commented 5 years ago

Of course 👍

havardl commented 5 years ago

Hey @alpgarcia, one question about the Bugzilla index. There doesn't seem to be any information about who closed the bug report in the enriched index. Is this information available in the raw index?

alpgarcia commented 5 years ago

Hey @havardl that's a tricky one, good job :)

The short answer is yes. The long story is you have something like:

"history": [
...
{
                "who": "netscape@seawood.org",
                "changes": [
                  {
                    "removed": "NEW",
                    "added": "RESOLVED",
                    "field_name": "status"
                  },
                  {
                    "removed": "",
                    "added": "FIXED",
                    "field_name": "resolution"
                  },
                  {
                    "removed": "",
                    "added": "2001-12-02 14:22:06",
                    "field_name": "cf_last_resolved"
                  }
                ],
                "when": "2001-12-02T22:22:06Z"
              },
...

So you need to parse the history list and look for those changes you are interested in.

Hope it helps!

havardl commented 5 years ago

Thanks for the clarification @alpgarcia! In the same alley, I'm wondering where we can find information about who either rejected or approved a pull request? I'm guessing we have to look in one of the raw indexes?

alpgarcia commented 5 years ago

@havardl,

we don't have that information in the current raw index. We are checking whether it would be possible to add that info to the indexes. We also have a new version of the GitHub backend, so we are also checking if it would retrieve the information you need.

Once confirmed, having new indexes with that info should take us 2-3 days. Would that work for you?

One final note to remind you RAW indexes do not contain SortingHat info, so you would need to query SortingHat with the email or GitHub handle to get the information of the corresponding identity.

havardl commented 5 years ago

Hi @alpgarcia, your proposed timeline works for us and thanks for the heads up on identification across indexes. Do you want to create a separate issue for that, and then we can close this?

alpgarcia commented 5 years ago

Yes @havardl, it is probably better to have that separate. Would you mind to create the issue?

I'll update it as soon as our engineering team confirms what information can we retrieve and in which specific index.

alpgarcia commented 5 years ago

@havardl, quick update, we don't need any issue because the team have confirmed that information of the person who closed the issue is not retrieved from GitHub in any of our indexes :(

canasdiaz commented 5 years ago

Can we close this ticket @havardl ?

havardl commented 5 years ago

👍