Automatic generation of page variants

dzhuang commented 8 years ago

I've cooked up some customized pages with randomly selected questions from a repo data file. Currently. It works well for human graded questions. And it only works for those kind of questions,

What should I do if I want to create automatic graded pages, for example, a TextQuestion, with randomly select questions? I can manage to generate all types of correct answers for all random selected questions in that repo file, just like how I generate the question body. But it seems currently the flow page creation process and grading process are using page.page_desc which passed validation, and which is hard coded in the yaml files. One more problem, all my question and answers are generated depending on page_context and page_data.data, it seems currently validation processes are all done without/before those context are created.

Is it possible to extend the current process without affecting existing pages? Or what can I do to change my pages to fit the currently process. Need your suggestions.

inducer commented 6 years ago

If the process need to run 4 seconds (below the default timed out), it's also a nightmare to render the page in place.

My best idea regarding page rendering time is to

cache the data that feeds into the template (this would be "moderately trustowrthy" instructor code that could run in a per-app-server persistent process) -- perhaps 1-2 s if it needs to be recomputed, otherwise near-instantaneous. Low-volume because it's just the dict feeding into the Jinja render.
cache the compiled Jinja templates (compiling Jinja templates is actually what's slow) -- since this is Python executable code, I'm not sure this can use regular Django caches... -- 0s in the precompiled case, 2s otherwise.
render the flow based on randomized data, just in time -- 0.1s for a large one
then parse the YAML (for now that's for the whole flow, which would actually be too slow--maybe there's some way to make this per page)
then run the page based on the flow desc.

dzhuang commented 6 years ago

LGTM, However, I wish to add an option that we can store the result elsewhere (in mongodb) besides cache, and let the instance maintainer to decide whether to enable that option. As for my instance, I can expect some instructors ( who are basically Python beginners) will use my instance in the up-coming semesters, I won't be suprise if their code contains while True: without break :-) (Of course that should fail validation, this is just an example of how possbile their code will be of low efficiency)

dzhuang commented 6 years ago

As for timeout of cache in django, it means "expire" doc. By default, the ttl is 300 seconds

The number of seconds before a cache entry is considered stale. If the value of this settings is None, cache entries will not expire.

inducer commented 6 years ago

LGTM

It doesn't yet LGTM.

I'm still uncomfortable with the whole-flow YAML re-parse, though I would like measurement data on how expensive that really is. Right now, we "cache" parsed YAML as JSON, which seems a bit silly... assuming that YAML is not that much slower to parse than JSON. With basic Python-based PyYAML, that may make sense, but as soon as PyYAML starts using libyaml, that extra caching step probably stops making sense. We might want to warn about that.
Another issue I see with the whole-flow approach is per-page "seed derivation". Suppose we have a per-flow seed. Deriving a seed by page number (as I suggested above) is actually broken: Suppose you add a page, all your page seeds change--that makes no sense. Doing this through the page ID makes more sense. But since we can't parse the YAML yet (at that stage), we'd have to supply the ID to, say, a Jinja macro. This could then expand to something containing the ID, so at least there wouldn't be redundancy, but the result would still not be very transparent to look at.

I wish to add an option that we can store the result elsewhere (in mongodb) besides cache

Wouldn't regular Django caching with a long/"infinite" timeout value do the same thing? The Django cache is just an abstraction, you can have a mongo backend if you like. I'd prefer sticking with the abstraction, to lessen the first-time installation burden.

dzhuang commented 6 years ago

Wouldn't regular Django caching with a long/"infinite" timeout value do the same thing?

In my current realization, Mongodb is used not only as cache, but also to store the map of commit_sha with what I call context_hash above (not just a key-value storage), so as to identify the need to do re-computation (by query whether that context_hash is the same for current commit_sha after course revision). Now you've proposed the "persistent generation server idea", that might reduce the need to do so.

dzhuang commented 6 years ago

Another problem I see, if we feeds data in to page template at the flow level (whole-flow approach), then it is not allowed to have a flow containing pages like the following:

groups:
-
  id: lp_solve
  shuffle: True
  pages:

  -
    generate_page_data:
        files:
        - lp_gen.py
        - my_lp.tpl
        random_data_set_file: lp_with_2_iterations.bin
        imports:
        - lpmodel.py
        - linprog.py
    ---
    type: TextQuestion
    id: lp_2_iter
    value: 5
    prompt: |
        # Solve the following linear programming problem
        {{ lp_problem_math }}
        The maximum value is:
    answers: {{ yaml_answers }}

  -
    generate_page_data:
        files:
           - lp_gen.py
           - my_lp.tpl
        random_data_set_file: lp_with_3_iterations.bin
        imports:
        - lpmodel.py
        - linprog.py
    ---
    type: TextQuestion
    id: lp_3_iter
    value: 5
    prompt: |
        # Solve the following linear programming problem
        {{ lp_problem_math }}
        The maximum value is:

    answers: {{ yaml_answers }}

If I'm right, we are not allowed to used same variable name (which are expected to have different value) across pages, and the consequence is those files which are supposed to be re-used across pages need to be customized flow-wise.

dzhuang commented 6 years ago

Another issue I see with the whole-flow approach is per-page "seed derivation"...

In this term, the issue I mentioned above doen't exist.

dzhuang commented 6 years ago

After diving into the code, I gradually understand your intention. The best place to inject the calculation result is during the first round YAML expansion. However, there's no request and no page instance at this stage. If that was not done, then the injection should happen in the second round expansion, with those variables (jinja literal) wrapped by {% raw %} and {% end raw %}， and with page_id unknown (then the "pure" data has no where to save, if we want to save it) since the page yaml is not parsed yet. I can now better understand that difficulty you concerned.

There's also another difficulty, in terms of course revision, if the "pure" question data is not saved during flow_start of the page, I have no idea of how to get that (previously used) data for re-computation. If we want to save it in its FlowPageData instance, where and how?

dzhuang commented 6 years ago

I think idea of injecting the generated data into the yaml template (at the jinja expansion stages) is almost impossible, in terms of page revision if we change generation code (about how to generate the data to be injected into the template) of an existing page, because we have no way to know what the specific "pure" data is used by the page.

dzhuang commented 6 years ago

Are there any progress can we make ( or new agreement)?

inducer commented 6 years ago

HI @mwest1066: out of curiosity: how do you guys separate privileges between app server and instructor code in PL? Something like this?

mwest1066 commented 6 years ago

We used to run instructor code in-process, which was fast and easy but we had a few cases where someone wrote some bad code with the expected results :-)

Now we run instructor code in separate worker processes that we explicitly launch and control. We don't rely on language sandboxing (tried it, but it was pretty messy and a bit fragile), and instead use the usual OS process-level protections. The worker processes run on the same machines as the main server processes and communicate with the server processes via a simple JSON message protocol. Our security model for this is protecting against accidental instructor code errors. We are not attempting to protect against arbitrary attacks from instructor code.

For student-provided code we go to a higher isolation level and only run that in a very restricted docker container on a separate fleet of machines.

I'd be happy to discuss details of what we do and our experiences of the tradeoffs involved if you are interested. Maybe over coffee sometime next week?

inducer / relate

Automatic generation of page variants #243