Show summary stats of users/participants on Experimenter

kimberscott commented 6 years ago

Pain point: Upon launch, we will need to work to increase the Lookit userbase via outreach and advertising, but we don't currently have a way to evaluate such efforts (e.g., to see how many people registered in the past week).

Acceptance criteria: The family outreach specialist can easily monitor and evaluate advertising efforts by answering questions like the following, using the Lookit admin or experimenter interface without doing any programming:

How many families are registered? How many registered in the last week? Between dates X and Y? (Ideally a graph would be helpful)
How many families have participated in studies? Between dates X and Y? What is the distribution of number of unique studies families participate in?
How many of those have logged in in the past 6 months?
What's the age distribution of children of active users?
How many children have each of the diagnoses listed? (dependent on #141) How many people registered (or participated in studies) each day over the last 3 months? etc.

Implementation notes/Suggestions: This can possibly be part of either the experimenter or the admin apps in Django. It seems like it might build on existing functionality in admin, except that we don't want usage to be limited to people who are actually admins (able to see/manipulate all data).

We've discussed building a dashboard and essentially fetching a bunch of data, then allowing filtering down from that using sliders/etc. (e.g. for age range, demographics). It could show things like new participants registered per week, a bar chart of the age distribution, tables of demographic form responses, and a plot of # unique study participants / week (one line for total unique participants, lines for individual studies).

It might turn out that there's nothing preventing us from allowing all researchers to use this from an ethics/privacy standpoint (if there's no way for them to get identifying info, just composite stats we could share with them anyway), which would be great someday, BUT the primary intended users are still a couple people at MIT for the purposes of whether we need to engineer database access based on many users.

kimberscott commented 5 years ago

Minor addition that would be helpful under same goal: allow experimenters to see "date created" of account model in addition to "last active," so that they can at least crudely tell the difference between existing participants and people who created accounts in response to their recruitment efforts

kimberscott commented 5 years ago

Rough sketch of what this might look like. This is essentially a big wish list that I expect to be pared down! Participant Stats

kimberscott commented 5 years ago

Permissions, per discussion just now:

In general: child and demographic info only from children associated with responses with consent approved (note there can be sensitive info included in free-response; aligning permissions with permissions for all other access keeps it straightforward to make sure we're not sharing info we shouldn't be sharing)
If superuser: child and demographic info from all children associated with responses for selected study (or any study, depending on selection)
- Eventually move this to separate permission so e.g. Kamaria can use this feature w/o being a superuser
Add text specifying which set is being used - e.g. in the 'define subset' portion just add 'and had consent video approved' after 'who have participated in [any study/specific study]' if user is not superuser

kimberscott commented 5 years ago

This is an amazing tool and I'm very excited about how powerful the pivot table approach is! I'm eager to get this through to production because I actually want to use it, having never had the time to go through this sort of info in much detail.

Nitpicking as always:

[x] Can we move the registration, participation overviews to the top (above pivot tables)?
[x] I'm not seeing any data in "Cumulative Registrations" locally - is that just me? Can you show how it looks with more data?
[x] In the Registrations / Participation graphs can we allow selection of the time window to look at? (For recruitment purposes we may sometimes want to look quite broadly and other times at, say, the last month.)
[x] In the Participation graph can "all studies" (meaning all that you have access to) be an option? Also can the text Study_study__name be changed to something like "Include children who have participated in..."
[x] Can we also allow looking at either cumulative or just new data - so that e.g. it's easy to answer the question "how many families registered each day of the last week" versus "how many families were registered last April vs now"
[x] Somewhere on/near the Registration graph can we show the total number of families registered
[x] Somewhere on/near the Participation graph can we show the total number of unique children & total number sessions
[x] Moving to the pivot table section: child age in months & child age in years need to be rounded (floored) so we can bin by month/year instead of just transforming the days variable :)
[x] Along with the multi-value field breakdowns, we need to have some way to display all values for the fields "Lookit referer" & "Other information", probably ideally with some way to toggle showing/hiding & only show non-empty values (but show count of how many empty)
[x] Is it possible to allow filtering of responses in pivot table / multi-value field breakdowns - e.g., only include people who have participated in X study / from users who have logged in within past year / who registered in certain time window?
[x] Point of curiosity only: what do the little arrows next to the dependent variable dropdown (e.g. "Total # responses") do?

Various requests for clarifying text:

[x] Indicate somewhere which families/children/responses are being included
[x] Need a very small amount of explanatory text about the pivot table, even just that this is how you can generate various summaries of participant characteristics, broken down by fields you can choose.
[x] Maybe remove header "multi-value field breakdowns" (doesn't make sense to someone who isn't thinking in terms of how this data is modeled, and doesn't apply to 'studies') - originally I was going to suggest changing it but I'm not sure what to or that it needs it
[x] "Pivot Presets" -> "Pivot Table Examples" (these aren't themselves especially important tables/graphs, just useful to see how things work, so I suspect presets may be misleading)
[x] I should provide some text to go at the top about the purpose of this view & some cautions about how the data can be used/shared (e.g. a reminder that demographic data may not be shared in a way that allows linking it to video, and an example of how that could happen without publishing an ID)

Datamance commented 5 years ago

Going to leave comments as I work through these:

For item, "I'm not seeing any data in "Cumulative Registrations" locally - is that just me? Can you show how it looks with more data?"

That's what mine looks like - yours could be not rendering because you don't have multiple users on your local instance.

kimberscott commented 5 years ago

Hmm, it looks like I have 6 users, 8 kids registered on my local instance but only one that's really participated. Is the number of registrations pulled from the same set of responses as everything else? Is there any way to get actual total registrations here?

Datamance commented 5 years ago

Real quick just going to address this:

Is it possible to allow filtering of responses in pivot table / multi-value field breakdowns - e.g., only include people who have participated in X study / from users who have logged in within past year / who registered in certain time window?

Using the arrows on the sidebar variables, you can restrict values. Time filtering would require a little ~more work~ hacking to wire the date filter (or a new one) to the underlying data set.

kimberscott commented 5 years ago

Does this end up restricting values in the multi-value field breakdowns as well? If not it might be worth doing that and possibly doing the hacking to use the date filter - the use case I'm thinking of is looking at the "how did you hear about Lookit" field for active or recently-recruited participants.

Datamance commented 5 years ago

Does this end up restricting values in the multi-value field breakdowns as well? If not it might be worth doing that and possibly doing the hacking to use the date filter - the use case I'm thinking of is looking at the "how did you hear about Lookit" field for active or recently-recruited participants.

It doesn't - the multi-value field breakdowns are predicated on the children (and probably should be called as such - something like "Child characteristics"). Right now, what I have served up into the template context is this (skipping a few intermediate lines of code):

        children_queryset = Child.objects.filter(
            id__in=annotated_responses.values_list("child", flat=True).distinct()
        )
        children_pivot_data = unstack_children(children_queryset, studies_for_child)

        ...

        ctx["studies"], ctx["languages"], ctx["characteristics"] = [
            dict(counter) for counter in children_pivot_data
        ]

So the data structures are pretty much hardcoded to give a count for all responses.

I could change this much in the way that we've been changing everything else in this view - basically defer the calculations to the browser, and instead of passing three dictionaries of counts into the template context, pass in a JSON blob of children (keys would be child UUIDs and values would be JS objects containing the lists of characteristics/languages/studies for those children).

This way, whenever we produce a new set of timeseries data and unique child IDs as a byproduct, we can use those IDs to key into that "child info" object and do the counts on the fly.

kimberscott commented 5 years ago

Hmm I guess I was conflating the multi-value field breakdowns and the free-response answer displays. I'm ok leaving the multi-value field breakdowns as is (as long as they're labeled so we know what's being counted where) but would like to be able to filter the "additional info" & "how did you hear about Lookit" fields.

Datamance commented 5 years ago

Point of curiosity only: what do the little arrows next to the dependent variable dropdown (e.g. "Total # responses") do?

¯\(ツ)/¯ it looks like they change the order?

Datamance commented 5 years ago

Indicate somewhere which families/children/responses are being included

Can you clarify this a bit? Do you mean including User and Child UUIDs somewhere?

Need a very small amount of explanatory text about the pivot table, even just that this is how you can generate various summaries of participant characteristics, broken down by fields you can choose.

Do we need another docs page for the pivot table?

I should provide some text to go at the top about the purpose of this view & some cautions about how the data can be used/shared (e.g. a reminder that demographic data may not be shared in a way that allows linking it to video, and an example of how that could happen without publishing an ID)

Sure thing - let me know when you have the language ready and I'll paste it in.

kimberscott commented 5 years ago

Indicate somewhere which families/children/responses are being included
Can you clarify this a bit? Do you mean including User and Child UUIDs somewhere?

Ah, sorry, I mean how are they being selected. So e.g. in the pivot table - only responses from studies you have read access to and where consent has been confirmed and children/accounts associated with those responses. (This may vary by superuser status?)

Need a very small amount of explanatory text about the pivot table, even just that this is how you can generate various summaries of participant characteristics, broken down by fields you can choose.

Do we need another docs page for the pivot table?

That's a good idea at some point, but for now really just 1-2 sentence statement of what this thing is.

Sure thing - let me know when you have the language ready and I'll paste it in.

Will do - can have tomorrow!

kimberscott commented 5 years ago

Proposed language for at the top:

The information on this page is provided primarily for the purposes of evaluating your recruitment efforts: how well do various approaches work? What populations do they reach? You may also find it helpful for reporting aggregate characteristics of your participants. Please note that demographic survey data may never be published such that it could be linked to an individual participant's video (see Terms of Use). Before sharing any demographic data, consider whether it might be possible to link to individual participants: e.g., because a child's name is mentioned in a comment or because only one family speaks a particular language and that language is used in their video.

kimberscott commented 5 years ago

This is so cool I hate to do more nitpicking because really I just want it up on prod, but... here are my nitpickings from looking at it on staging:

Explanatory text:

[x] Let's move "The data that you see in this page corresponds to responses from studies that you have read access to where consent has been confirmed, along with children/accounts associated with those responses." from the participation/registration section down to summary stats, since participation/registration does include non-confirmed consent
[x] But add info inside "recruitment data plots" saying what data is included in "participation" and "registration". (Participation - check if this is correct: all responses to studies you have read access to, regardless of consent confirmation status. a response is created each time a participant completes the consent frame of your study. Registration: is this all registered Lookit participants?)
[x] Add info inside "global data filters" indicating what that applies to. (Participation, but not registrations; pivot table? child characteristics?)
[x] Clarify in "global data filters" that date range is of participation
~Inside global data filters (if it applies to the pivot table/child characteristics) or inside the pivot table, can we have some way to select "active users" - e.g. people who have logged in w/i past year? (We have last login timestamp on the account.) Would be helpful for evaluating overall characteristics of participant userbase - right now we're mixing in a lot of accounts moved over from the old platform where people have never logged in and likely never will.~
[x] Does being a superuser affect what data is included in participation, registration, pivot table, and/or child characteristics? If so include note to that effect somewhere. (Can be shown only if user is superuser, or shown regardless.)

Pivot table:

[x] The charts for "unique families", "unique children," and "child age..." aggregators would be more helpful with y axis labels
[x] Age in months/years still needs to be rounded (floored) so that aggregation works, unless I'm missing something - here's an example of what I see:
[x] Let's remove the "birth day of week", "birth month", and "birth year" - no reason anyone needs these for recruitment or summary stats, and some risk of accidentally-identifying info
[x] Let's add in spouse educational attainment since it's in the demographic survey

Datamance commented 5 years ago

does being a superuser affect what data is included in participation, registration, pivot table, and/or child characteristics? If so include note to that effect somewhere. (Can be shown only if user is superuser, or shown regardless.)

It does, I just had the wrong text in the wrong conditional block. You'll see that you get data for all responses (with the added restriction of consented only for the pivot table) when you're a superuser.

Datamance commented 5 years ago

The charts for "unique families", "unique children," and "child age..." aggregators would be more helpful with y axis labels I'm seeing them on mine -

Are they just not rendering on yours?

Datamance commented 5 years ago

Inside global data filters (if it applies to the pivot table/child characteristics) or inside the pivot table, can we have some way to select "active users" - e.g. people who have logged in w/i past year? (We have last login timestamp on the account.) Would be helpful for evaluating overall characteristics of participant userbase - right now we're mixing in a lot of accounts moved over from the old platform where people have never logged in and likely never will.

I'm a little confused - if a participant has had a session in some time window, then they have also logged in, no? Put in terms of the contrapositive: If a user hasn't logged in, they haven't had a session, which means they wouldn't be in the dataset to begin with. Unless I'm missing something?

kimberscott commented 5 years ago

Ohhh that's fair, sorry. It's possible to have logged in but not participated, but probably uncommon and the distinction isn't too important - it would indeed work just as well to select for having participated in a study in the past year. So nevermind!

kimberscott commented 5 years ago

Sorry, I meant y axis tick labels - numbers for the horizontal lines. I do see the labels like "Unique families."

Datamance commented 5 years ago

It looks like this is a bug with google charts https://github.com/google/google-visualization-issues/issues/2693

I'm going to look for workarounds and let you know what I find.

Datamance commented 5 years ago

It does look like we'll have to revert to an old version (45) in order to have this work properly https://github.com/nicolaskruchten/pivottable/issues/1082

Let's hope it doesn't affect too much else!

Datamance commented 5 years ago

Closing this beast.

lookit / lookit-api

Show summary stats of users/participants on Experimenter #134