oss-aspen / 8Knot

Dash app in development to serve open source community visualizations using GitHub data from Augur. Hosted app: https://eightknot.osci.io
MIT License
47 stars 59 forks source link

Contributor Issue assignee Graph #442

Closed cdolfi closed 12 months ago

cdolfi commented 1 year ago

Things to note during review:

cdolfi commented 1 year ago

@JamesKunstle lmk what you think, ill wait to do any changes on these two until we come to a consensus

JamesKunstle commented 1 year ago

@cdolfi Here's a unified response to your responses:

I've become aware of the subtlety of this visualization that I wasn't aware of last night, even looking at the processing steps: this shows new assignment events per contributor per bin, rather than number of issues contributor is assigned to per bin. Is that correct?

If that's the case, then I'd suggest a corresponding title like "Issue Assignment Events" or something - the subtlety has to be squashed because the analysis being done is more interesting than a naive count of something like "In march, Cali had 4 issues assigned to her, and in April those issues were still assigned, so the count is still 4."

I'd also suggest an Info along the lines of:

"Visualizes the number of times an issue was assigned to a contributor, per contributor, per time-bin."

This would clarify that what's being counted isn't the number of issues assigned to each contributor but rather the direct action of an issue being assigned. To me, that isn't immediately obvious when looking at the data without further context."

It's a comms problem way more than an analysis problem, and as always, we're trying to pack all the context into like, 4 words and 20 words for title and info, respectively haha.

cdolfi commented 1 year ago

@JamesKunstle So this visualization actually does show all issues that are currently assigned to them during that time interval. With the processing steps, it first takes out all the issues created after the time interval and closed before had, therefore only leaving the issues currently open during that time interval (and that has ever been assigned to them in the first place but.. ill get to this later in this comment). Then it filters out the assignments or unassignments that occurred after the time interval. Then counts the assignments and unassignments to get the delta even if they were assigned before that time interval.

This comment though did point out to me that I have not taken into account the issues that have never been assigned. This doesnt impact this visualization, but it greatly impacts the other one. Im going to change the query to work for both cases, so this has been extremely important logic for me revisit

cdolfi commented 1 year ago

@JamesKunstle Updated the query. With the left join (which is necessary to get all the issue values), the query is much slower which makes sense. Once we are locked in on this query and the pr equivalent ill get a materialized view made of it

JamesKunstle commented 12 months ago

@cdolfi I need to spend a bit more time with your PR now that I've read your latest comments- I don't think I intuited the nuance of the data initially. Will have all PR reviews for these four PRs in by end of my work day, which will be past EOB (will certainly be available at the start of your workday tomorrow.)

cdolfi commented 12 months ago

@JamesKunstle For some reason it is not letting me comment directly on your last one so Ill put it here. The count will never be negative because it is counting all assignments and un assignments up until the end date.

Lines 314-320 take out the issue assignments on issues that where opened after the end date, and the issues closed before the start date. This leaves the df_in_range to only be the issue assignments on issues that are open during the time interval.

cdolfi commented 12 months ago

@JamesKunstle ready for review based on comments

JamesKunstle commented 12 months ago

@JamesKunstle For some reason it is not letting me comment directly on your last one so Ill put it here. The count will never be negative because it is counting all assignments and un assignments up until the end date.

Lines 314-320 take out the issue assignments on issues that where opened after the end date, and the issues closed before the start date. This leaves the df_in_range to only be the issue assignments on issues that are open during the time interval.

  • lines 258-261: Takes all issue events that are "unassigned" and open during the time interval. Then cuts off any unassignments that have happened after the end date. Then NOTE: it does not cut off the assignments that happened before the time interval
  • lines 258-261: Takes all issue events that are "unassigned" and open during the time interval. Then cuts off any unassignments that have happened after the end date. NOTE: it does not cut off the assignments that happened before the time interval

It can be though of as a counter. For each assignment there is one more assigned to the contributor, and then for each un-assignment there is one less, so subtracting those two values gives you the number of assignments in that time interval. And this is only including assignments to that specific contributor (line 311). @cdolfi

What happens if an issue, which is open across two time-windows, is assigned to a contributor in the first window and unassigned in the next window?

e.g. For January, issue 1 is opened and assigned to contrib 'A', then in Febuary it's unassigned to 'A' but isn't assigned to anyone else. It also isn't closed in Feb. Wouldn't this lead to a negative count for contributor 'A'? They'd have a single 'unassignment' event for Febuary.

Admittedly, this is a corner case, and perhaps I'm missing something obvious in the data or analysis, but it seems like this is possible.

-- Once this nuance is solved, I want to turn attention back to the Info section, but I want to understand the above situation from your perspective first. I fear that I might come across as too picky with the 'Info' section on some visualizations, but that's only because I'm frequently frustrated with other people's descriptions in similar settings and want our to be very clearly better.

cdolfi commented 12 months ago

@JamesKunstle For some reason it is not letting me comment directly on your last one so Ill put it here. The count will never be negative because it is counting all assignments and un assignments up until the end date. Lines 314-320 take out the issue assignments on issues that where opened after the end date, and the issues closed before the start date. This leaves the df_in_range to only be the issue assignments on issues that are open during the time interval.

  • lines 258-261: Takes all issue events that are "unassigned" and open during the time interval. Then cuts off any unassignments that have happened after the end date. Then NOTE: it does not cut off the assignments that happened before the time interval

^ see note

  • lines 258-261: Takes all issue events that are "unassigned" and open during the time interval. Then cuts off any unassignments that have happened after the end date. NOTE: it does not cut off the assignments that happened before the time interval

^ see note

It can be though of as a counter. For each assignment there is one more assigned to the contributor, and then for each un-assignment there is one less, so subtracting those two values gives you the number of assignments in that time interval. And this is only including assignments to that specific contributor (line 311). @cdolfi

What happens if an issue, which is open across two time-windows, is assigned to a contributor in the first window and unassigned in the next window?

e.g. For January, issue 1 is opened and assigned to contrib 'A', then in Febuary it's unassigned to 'A' but isn't assigned to anyone else. It also isn't closed in Feb. Wouldn't this lead to a negative count for contributor 'A'? They'd have a single 'unassignment' event for Febuary.

No, see above. Let me know if you want to hop on a call if it would be easier for me to explain the logic verbally

Admittedly, this is a corner case, and perhaps I'm missing something obvious in the data or analysis, but it seems like this is possible.

-- Once this nuance is solved, I want to turn attention back to the Info section, but I want to understand the above situation from your perspective first. I fear that I might come across as too picky with the 'Info' section on some visualizations, but that's only because I'm frequently frustrated with other people's descriptions in similar settings and want our to be very clearly better.

JamesKunstle commented 12 months ago

@cdolfi Okay take-two:

I understand now- the missing piece was not catching the detail "...does not cut off the assignments that happened before the time interval"

I see now that for each successive time-bin, unassignments are matching previously counted assignments, and new assignments without paired unassignments increase the count for that bin.

JamesKunstle commented 12 months ago

@cdolfi would you mind please manually merging?