openedx / axim-engineering

GitHub Issue repository for the Axim engineering team
https://openedx.atlassian.net/wiki/spaces/COMM/pages/3241640370/Axim+Collaborative+Engineering+Team
4 stars 2 forks source link

Discovery: Script solution for Instance survey #62

Closed jmakowski1123 closed 2 years ago

jmakowski1123 commented 2 years ago

Context

tCRIL is generating an Impact Report that quantifies the landscape of Open edX Instances globally. A large portion of this data will be elicited directly from Providers, within the boundaries of the standard Provider contract. Draft survey questions are here. To facilitate survey uptake, we'd like to automate the process of answering some or all of the questions. The results of the survey will be analyzed and summarized in aggregate, and the anonymized results shared publicly. We will present the results at the Open edX conference in April.

Acceptance Criteria:

The Provider is given a quick and seamless method by which to autogenerate data to answer the following questions for each of their Instances:

The end-result data is captured in .cvs (or similar), with a clear connection between each Instance URL and its corresponding data listed above.

Approach:

The purpose of this ticket is to explore solutions for the method by which to autogenerate data, and to propose a recommended method/approach. Based on a brainstorming session during the January 5 Standup, one highly viable approach is to write a script that Providers can embed into each of their Instances.

ormsbee commented 2 years ago

Some questions:

  1. What level of accuracy do we care about? So for instance, if the # of learners or enrollments is off by 5%, do we care? (Some things are faster/simpler to query if we accept less accuracy.)
  2. Do we want to capture anything more granular about enrollments/completions for some particular window of time? Next year, would we want to ask the question: “how many certificates were granted this year?” Or is it enough that we capture the snapshots of all of these and we can extrapolate the delta year to year?
  3. If it was not a burden on the site operator, would we want updates at some higher frequency (quarterly, monthly)?

The purpose of this ticket is to explore solutions for the method by which to autogenerate data, and to propose a recommended method/approach. Based on a brainstorming session during the January 5 Standup, one highly viable approach is to write a script that Providers can embed into each of their Instances.

Getting people to install something is a high barrier to entry for a survey. It might make sense to start with a Google Forms sort of approach for the first iteration of this, and to have a bundled app that operators could choose to opt into starting with the Nutmeg release.

ormsbee commented 2 years ago

That app could later be used for other really useful pulse-of-the-community things like determining which sites use which feature flags, and other things that would be useful to know for support and deprecation purposes. Some sites might not want to give up their enrollment numbers, but they might be at least willing to share which features they're using if it can be collected automatically.

jmakowski1123 commented 2 years ago

Some questions:

  1. What level of accuracy do we care about? So for instance, if the # of learners or enrollments is off by 5%, do we care? (Some things are faster/simpler to query if we accept less accuracy.)

Given that the nature of the project (quantifying current Open edX Instances) is fairly opaque to begin with, my initial thought is that we can err on the side of fair/reasonable accuracy. But I'm very curious for @e0d thoughts?

For a framework, I'd say strike a balance between reasonable accuracy and the realistic timeline for this project, which is to gather, analyze and present the data at the April conference (sorry...that's very un-Agile-like, with such a hard deadline!)

  1. Do we want to capture anything more granular about enrollments/completions for some particular window of time? Next year, would we want to ask the question: “how many certificates were granted this year?” Or is it enough that we capture the snapshots of all of these and we can extrapolate the delta year to year?

For the purposes of the first survey, I think it's enough to be able to say the current number of learners, the current number of enrollments, and perhaps the number of certificates/credentials granted to date. Assuming we run the survey annually, next year we could put a time wrap around it (ie "in CY 2022"). Again curious for @e0d thoughts?

  1. If it was not a burden on the site operator, would we want updates at some higher frequency (quarterly, monthly)?

At the moment, I think once a year is realistic in terms of what our goals are (an annual impact report), but I also hope this project expands with community involvement and can see scenarios where more frequent updates could be of interest to the Marketing WG for example. So if not a burden to site operators, perhaps biannually and quarterly as a start?

The purpose of this ticket is to explore solutions for the method by which to autogenerate data, and to propose a recommended method/approach. Based on a brainstorming session during the January 5 Standup, one highly viable approach is to write a script that Providers can embed into each of their Instances.

Getting people to install something is a high barrier to entry for a survey. It might make sense to start with a Google Forms sort of approach for the first iteration of this, and to have a bundled app that operators could choose to opt into starting with the Nutmeg release.

Would the Google Form then be filled out manually for each Instance? I can see that also being a barrier to operators who are running many Instances? Even if we only got a ~10% rate of install in the first go-round, that's still more data than we have now, and would be the bar to raise next year. Maybe there's a hybrid approach where we can give folks the option, either an install or the Google form? And I like the bundled app idea with Nutmeg as a long-term sustainable solution.

ormsbee commented 2 years ago

The general theme with the technical discovery is that we can get rough numbers in a relatively straightforward manner, but that true accuracy involves accounting for a number of edge cases that I don't think are worth it for the first pass at this problem.

  1. Number of unique courses

The fastest and most reliable way to get this is a count on CourseOverview. There are a few caveats here. Just because a course exists doesn't mean that anyone can see it or use it. There are a few fields that can help guide us (start, end, and self_paced), but sometimes courses are created as scratch spaces and might not represent something that's ever seen by students.

Recommended approach: Simple count of CourseOverview rows, and ignore any subtleties about scheduling or enrollments.

  1. Total number of learners using the site

This would require a count on the User table. This can also be distorted by banned users (spam accounts), or from dummy-users created for the purposes of an LTI launch where Open edX is an LTI provider. Banned users are an obscure edge case though.

Recommended approach: Simple count of the User model, minus a simple count of the LtiUser model.

  1. Total number of enrollments for all courses

@jmakowski1123: This could be a count of all currently active enrollments, or all enrollments that were ever made. The latter would mean that we'd still count an enrollment if someone enrolled in a course and then unenrolled some time later. When counting all enrollments ever made, we wouldn't double-count re-enrollments–i.e. if someone enrolled in a course, unenrolled, and re-enrolled, that would still count as only one enrollment.

Getting all enrollments that were ever made is slightly cheaper, but both are relatively straightforward to get–it's just a matter of filtering on the is_active field. Please let me know which one you'd like (or if you'd like both).

  1. Total number of course completions/certificates granted

We can get this from the GeneratedCertificate model, but it's honestly kind of a mess in terms of ensuring accuracy when these are generated. We also have many different "modes" that a certificate can be granted in (e.g. "verified", "masters", "credit"). So it's probably best to get a simple count that is equivalent to "this person passed a course", and not try to dig too far into the types of certificates, the significance of which likely varies from site to site.

  1. Primary language of instruction

We can get a count of courses by language, but this might be pretty messy and unreliable data. This can be queried using the language field in CourseOverview.

  1. Other languages of instruction

Same approach and caveats as (5).

ormsbee commented 2 years ago

If we want to do this as a survey app in the Django Admin (accessible by site operators), we'd need the following:

Installation Options

There are two main ways I could see us going with this:

  1. An installable plugin app.
  2. Build it into edx-platform itself.

I actually prefer building this into edx-platform because it is so tightly coupled with that repository (at least for the data being collected here). It needs to directly query a number of edx-platform data models, and we'd want those tests to run during CI to make sure nothing breaks from release to release. It would also be really convenient if, whenever you're looking to deprecate a feature flag, you could add it to the list of things that the survey app scans for. However doing so would put us in a situation where we wouldn't be getting results back until people started running Numeg in the middle of this year (and long after the conference).

An alternative is to initially develop it as a plugin app, but fold it into edx-platform in time for Nutmeg. I really don't think we're going to get many people to install it this way though.

Options to consider

There can be at least two high level goals for such a script:

  1. Estimate impact (the origin of this story)
  2. Sample the options/configurations being used (useful for DEPR).

I suspect that more people will be willing to give (2) than (1), so it might be worth giving an option to separate the two. I am assuming that this will be strictly opt-in.

ormsbee commented 2 years ago

Would the Google Form then be filled out manually for each Instance? I can see that also being a barrier to operators who are running many Instances? Even if we only got a ~10% rate of install in the first go-round, that's still more data than we have now, and would be the bar to raise next year. Maybe there's a hybrid approach where we can give folks the option, either an install or the Google form? And I like the bundled app idea with Nutmeg as a long-term sustainable solution.

Yes, it would be in this case. But so would the Admin option for sending the data. I suppose we could make a setting that says, "Just always send this information every X months if you haven't before." and default that to False? So most people wouldn't use it, but only those that have a hundred sites and want to opt in?

ormsbee commented 2 years ago

@jmakowski1123: FWIW, I think that we should send this year's survey out via Google Form and have folks fill it in as before, and then target doing this in the Django Admin for Nutmeg. I really can't see folks installing this as a separate plugin in useful numbers–it's just going to be so much faster for them to fill in a form.

My best guess at this is a couple of weeks of work if there's a really bare-bones UI and not counting any analysis work we'd do on the other end. Most of the effort is in the admin interface and making sure we don't bring sites down when running these large queries–though we should probably go through group estimation.

e0d commented 2 years ago

1) I agree that targeting Nutmeg for a better solution makes sense.

2) I think a form for the first take is workable, I'd prefer to use FormAssembly over a Google Form. Better capabilities, also integrates with Google Sheets.

The draft form was build in FormAssembly.

3) Is there a form of technical documentation that we would provide with the form to help people successfully fill it out. How to we help people do this, for example

Recommended approach: Simple count of the User model, minus a simple count of the LtiUser model.

4) I think we should have a simple "power user" option where they can submit a Google Sheet with the same rows and columns as our authoritative sheet. This would allow, say, eduNEXT to dump all the tenant sites into a single sheet rather than filling out the form 1000 times. This approach increases our work only the tiniest bit.

ormsbee commented 2 years ago
  1. I think a form for the first take is workable, I'd prefer to use FormAssembly over a Google Form. Better capabilities, also integrates with Google Sheets.

Works for me. I default to Google Forms because that's the only thing I've used. Happy to defer to those who have used other products in this area.

  1. Is there a form of technical documentation that we would provide with the form to help people successfully fill it out. How to we help people do this, for example

Sure, I can give some queries for them to run. It'd be nice if edX could run them on their read replica to test early, but it's not absolutely required. That's probably only a couple hours of actual work with the caveats I put in the recommended queries above. Might take more calendar time if someone at edX is testing and we get weird results that we need to debug.

@jmakowski1123: Assigning this to you for you to weigh in on. Please feel free to move to "Done" if you're okay with the conclusions here, or assign it back to me if you have feedback, questions, or other areas you feel need further investigation.

Thank you.

jmakowski1123 commented 2 years ago
  1. I think a form for the first take is workable, I'd prefer to use FormAssembly over a Google Form. Better capabilities, also integrates with Google Sheets.

Works for me. I default to Google Forms because that's the only thing I've used. Happy to defer to those who have used other products in this area.

This makes sense to me. I suggest we prune the number and types of questions we ask in the form, in order to make this as easy and quick as possible. Maybe we even limit it to query-based questions for now. Then we can focus on a more well-rounded question set that aligns with the long-term Nutmeg install option.