Open dzhuang opened 8 years ago
If the process need to run 4 seconds (below the default timed out), it's also a nightmare to render the page in place.
My best idea regarding page rendering time is to
LGTM, However, I wish to add an option that we can store the result elsewhere (in mongodb) besides cache, and let the instance maintainer to decide whether to enable that option. As for my instance, I can expect some instructors ( who are basically Python beginners) will use my instance in the up-coming semesters, I won't be suprise if their code contains while True:
without break
:-) (Of course that should fail validation, this is just an example of how possbile their code will be of low efficiency)
As for timeout
of cache in django, it means "expire" doc. By default, the ttl is 300 seconds
The number of seconds before a cache entry is considered stale. If the value of this settings is None, cache entries will not expire.
LGTM
It doesn't yet LGTM.
I'm still uncomfortable with the whole-flow YAML re-parse, though I would like measurement data on how expensive that really is. Right now, we "cache" parsed YAML as JSON, which seems a bit silly... assuming that YAML is not that much slower to parse than JSON. With basic Python-based PyYAML, that may make sense, but as soon as PyYAML starts using libyaml, that extra caching step probably stops making sense. We might want to warn about that.
Another issue I see with the whole-flow approach is per-page "seed derivation". Suppose we have a per-flow seed. Deriving a seed by page number (as I suggested above) is actually broken: Suppose you add a page, all your page seeds change--that makes no sense. Doing this through the page ID makes more sense. But since we can't parse the YAML yet (at that stage), we'd have to supply the ID to, say, a Jinja macro. This could then expand to something containing the ID, so at least there wouldn't be redundancy, but the result would still not be very transparent to look at.
I wish to add an option that we can store the result elsewhere (in mongodb) besides cache
Wouldn't regular Django caching with a long/"infinite" timeout
value do the same thing? The Django cache is just an abstraction, you can have a mongo backend if you like. I'd prefer sticking with the abstraction, to lessen the first-time installation burden.
Wouldn't regular Django caching with a long/"infinite" timeout value do the same thing?
In my current realization, Mongodb is used not only as cache, but also to store the map of commit_sha with what I call context_hash
above (not just a key-value storage), so as to identify the need to do re-computation (by query whether that context_hash
is the same for current commit_sha after course revision). Now you've proposed the "persistent generation server idea", that might reduce the need to do so.
Another problem I see, if we feeds data in to page template at the flow level (whole-flow approach), then it is not allowed to have a flow containing pages like the following:
groups:
-
id: lp_solve
shuffle: True
pages:
-
generate_page_data:
files:
- lp_gen.py
- my_lp.tpl
random_data_set_file: lp_with_2_iterations.bin
imports:
- lpmodel.py
- linprog.py
---
type: TextQuestion
id: lp_2_iter
value: 5
prompt: |
# Solve the following linear programming problem
{{ lp_problem_math }}
The maximum value is:
answers: {{ yaml_answers }}
-
generate_page_data:
files:
- lp_gen.py
- my_lp.tpl
random_data_set_file: lp_with_3_iterations.bin
imports:
- lpmodel.py
- linprog.py
---
type: TextQuestion
id: lp_3_iter
value: 5
prompt: |
# Solve the following linear programming problem
{{ lp_problem_math }}
The maximum value is:
answers: {{ yaml_answers }}
If I'm right, we are not allowed to used same variable name (which are expected to have different value) across pages, and the consequence is those files
which are supposed to be re-used across pages need to be customized flow-wise.
Another issue I see with the whole-flow approach is per-page "seed derivation"...
In this term, the issue I mentioned above doen't exist.
After diving into the code, I gradually understand your intention. The best place to inject the calculation result is during the first round YAML expansion. However, there's no request and no page instance at this stage. If that was not done, then the injection should happen in the second round expansion, with those variables (jinja literal) wrapped by {% raw %}
and {% end raw %}
, and with page_id
unknown (then the "pure" data has no where to save, if we want to save it) since the page yaml is not parsed yet. I can now better understand that difficulty you concerned.
There's also another difficulty, in terms of course revision, if the "pure" question data is not saved during flow_start of the page, I have no idea of how to get that (previously used) data for re-computation. If we want to save it in its FlowPageData instance, where and how?
I think idea of injecting the generated data into the yaml template (at the jinja expansion stages) is almost impossible, in terms of page revision if we change generation code (about how to generate the data to be injected into the template) of an existing page, because we have no way to know what the specific "pure" data is used by the page.
Are there any progress can we make ( or new agreement)?
HI @mwest1066: out of curiosity: how do you guys separate privileges between app server and instructor code in PL? Something like this?
We used to run instructor code in-process, which was fast and easy but we had a few cases where someone wrote some bad code with the expected results :-)
Now we run instructor code in separate worker processes that we explicitly launch and control. We don't rely on language sandboxing (tried it, but it was pretty messy and a bit fragile), and instead use the usual OS process-level protections. The worker processes run on the same machines as the main server processes and communicate with the server processes via a simple JSON message protocol. Our security model for this is protecting against accidental instructor code errors. We are not attempting to protect against arbitrary attacks from instructor code.
For student-provided code we go to a higher isolation level and only run that in a very restricted docker container on a separate fleet of machines.
I'd be happy to discuss details of what we do and our experiences of the tradeoffs involved if you are interested. Maybe over coffee sometime next week?
I've cooked up some customized pages with randomly selected questions from a repo data file. Currently. It works well for human graded questions. And it only works for those kind of questions,
What should I do if I want to create automatic graded pages, for example, a
TextQuestion
, with randomly select questions? I can manage to generate all types of correct answers for all random selected questions in that repo file, just like how I generate the question body. But it seems currently the flow page creation process and grading process are usingpage.page_desc
which passed validation, and which is hard coded in the yaml files. One more problem, all my question and answers are generated depending onpage_context
andpage_data.data
, it seems currently validation processes are all done without/before those context are created.Is it possible to extend the current process without affecting existing pages? Or what can I do to change my pages to fit the currently process. Need your suggestions.