inducer / relate

RELATE is an Environment for Learning And TEaching
http://documen.tician.de/relate
Other
383 stars 118 forks source link

Automatic generation of page variants #243

Open dzhuang opened 8 years ago

dzhuang commented 8 years ago

I've cooked up some customized pages with randomly selected questions from a repo data file. Currently. It works well for human graded questions. And it only works for those kind of questions,

What should I do if I want to create automatic graded pages, for example, a TextQuestion, with randomly select questions? I can manage to generate all types of correct answers for all random selected questions in that repo file, just like how I generate the question body. But it seems currently the flow page creation process and grading process are using page.page_desc which passed validation, and which is hard coded in the yaml files. One more problem, all my question and answers are generated depending on page_context and page_data.data, it seems currently validation processes are all done without/before those context are created.

Is it possible to extend the current process without affecting existing pages? Or what can I do to change my pages to fit the currently process. Need your suggestions.

inducer commented 8 years ago

Hum, I'm not sure I follow what you're saying. If you'd like to randomly pick questions from a pool, then I'd recommend you use question groups along with the shuffle and max_page_count attributes:

Or am I misunderstanding what you're trying to do?

dzhuang commented 8 years ago

I wish to do it in a different way.

The pool is in a data file in repo. Rather than coding the question and answer in the yaml, I wish to use some python snippets and jinja templates, which are also in repo, to generate the question body and correct answer of the question. Each time a new page is created, the initialize_page_data method will pick a data randomly from the data file, and generate the question body, along with the correct answer of the question, and store the selected data in the page_data. When that page is opened, the body and correct answer will be rendered by the python snippets. The benefit is that I can add new questions to the pool simply by adding new data to the question data file, and if I need to change the rendering of the question body/correct answers, I can just edit the jinja template.

inducer commented 8 years ago

I see. I think I understand better now what you're trying to do. I would however prefer to build the automatic generation process into initialize_page_data, to avoid the pre-generation step that you currently have to go through. Also, when I thought about the page API, I've always thought of page_desc as just the YAML chunk corresponding to the page, and I didn't think this would (or should) change. Instead, the make_* page API functions should be extended to insert data into page_desc on the fly. Markup expansion for the question pieces already happens under the control of that code, so in principle it should be simple to just make some more variables available. I'm thinking something like:

type: TextQuestion # or maybe GeneratedTextQuestion?
generation: |
    answers = [{"type": float, ...}]
    markup_variables = {"x": ...}
    # (These would get stuffed into page_data)
prompt: |

    # Compute when Monsoon Season Happens

    Consider data {{x}}...

What do you think?

dzhuang commented 8 years ago

I need a while to fully understand your suggestions. Many thanks.

I see. I think I understand better now what you're trying to do. I would however prefer to build the automatic generation process into initialize_page_data, to avoid the pre-generation step that you currently have to go through. Also, when I thought about the page API, I've always thought of pagedesc as just the YAML chunk corresponding to the page, and I didn't think this would (or should) change. Instead, the make* page API functions should be extended to insert data into page_desc on the fly. Markup expansion for the question pieces already happens under the control of that code, so in principle it should be simple to just make some more variables available. I'm thinking something like:

type: TextQuestion # or maybe GeneratedTextQuestion?generation: | answers = [{"type": float, ...}] markup_variables = {"x": ...} # (These would get stuffed into page_data)prompt: | # Compute when Monsoon Season Happens Consider data {{x}}...

What do you think?

dzhuang commented 7 years ago

A question: if the page_desc is changed on the fly, it might also face the problem that it won't validate any more. For example, the inline question need to validate the Struct of the answers when initialized, if the page need to generate the questions and answers on the fly, how to make sure the struct generated after validation will pass validation?

inducer commented 7 years ago

A question: if the page_desc is changed on the fly, it might also face the problem that it won't validate any more.

Yes. I'm imagining we could deal with that by

dzhuang commented 7 years ago

And that means we'll have to provide a fake page_desc which contains all the required attributes so as to pass the initial validation.

Thanks for your explanation!

dzhuang commented 6 years ago

Hi@inducer, I have been working on this for a long time and almost fininshed, with tests. Now I am stucked on the naming of the base class, which can be mixed with all other built-in page classes. Currently, I named the base class RandomQuestionBase, and to create a subclass, i.e.,

class RandomInlineMultiQuestion(RandomQuestionBase, InlineMultiQuestion):
       ......

Is that acceptable?

dzhuang commented 6 years ago

P.S., I used your docker runpy to generate the question using python by jinja2.

inducer commented 6 years ago

How about GeneratedQuestionBase?

dzhuang commented 6 years ago

That LGTM, thanks.

inducer commented 6 years ago

Or VariantGeneratingQuestionBase?

dzhuang commented 6 years ago

That also ok for me. If taking into account of subclassing, which one do you prefer?

inducer commented 6 years ago

I think naming-wise it's smart to have "variant" in there to emphasize that the point is not to generate questions, but to generate many versions of the same (type of) question.

inducer commented 6 years ago

I'm generally biased in favor of long, clear names. I know that's a matter of taste though, and I am somewhat capable of compromise. :)

dzhuang commented 6 years ago

That's sound reasonable. I'll submit a PR after #378 is merged.

dzhuang commented 6 years ago

How about all subclass of VariantGeneratingQuestionBase take a prefix Variant?

class VariantInlineMultiQuestion(VariantGeneratingQuestionBase, InlineMultiQuestion):
        pass
inducer commented 6 years ago

I'd be OK with a Variant prefix.

dzhuang commented 6 years ago

Then what about this (I know I should write some docs, but maybe you can give some advice at this stage):

        type: variantInlineMultiQuestion
        id: rand
        value: 3
        access_rules:
            add_permissions:
                - see_correctness
                - change_answer

        prompt: |
            # variant question test

        data_files:
            - question-data/variant-page-test/jinja_env.py
            - question-data/variant-page-test/pyprog.py
            - question-data/variant-page-test/test_data_01.bin
            - question-data/variant-page-test/test_variant_runpy.py
            - question-data/variant-page-test/test_template.tpl
        excluded_cache_key_files:
            - question-data/variant-page-test/pyprog.py
        random_question_data_file: question-data/variant-page-test/test_data_01.bin

        runpy_context:
            display_markdown: True

        common_code: |

            from io import BytesIO
            import pickle
            import json

            prog = data_files["question-data/variant-page-test/pyprog.py"].decode("utf8")
            exec(prog)
            jinja_env_str = data_files["question-data/variant-page-test/jinja_env.py"].decode("utf8")
            exec(jinja_env_str)
            template_file = data_files[
                "question-data/variant-page-test/test_template.tpl"].decode("utf8")
            template = jinja_env.from_string(template_file)
            bio = BytesIO(data_files["question_data"])

            d = pickle.load(bio)

            o = PlusFirstElement(**d)
            q = "$%s=$" % " + ".join([str(i) for i in [o.x, o.y, o.z]])
            answer = o.solve()

            try:
                runpy_context
            except:
                runpy_context = {}

            display_markdown = runpy_context.get("display_markdown", False)

            blank_description = (
                u"the result:"
            )

            pre_description = u"What is the result?"

            prompt_tex = template.render(
                show_question=True,
                show_answer=False,
                pre_description=pre_description,
                q=q,
            )
            if display_markdown:
                prompt_tex += "[prompt](http://prompt.example.com)"

            question_tex = template.render(
                show_question=False,
                show_answer=False,
                blank_description=blank_description,
                show_blank=True,
                q=q,
            )
            if display_markdown:
                question_tex += "[question](http://question.example.com)"

            answers_tex = template.render(
                show_question=False,
                show_answer=False,
                show_blank_answer=True,
                blank_description=blank_description,
                q=q,
                answer=answer,
            )

            answer_explanation_tex = template.render(
                show_answer_explanation=True,
                answer="$%s$" % answer,
            )
            if display_markdown:
                answer_explanation_tex += "[explanation](http://explanation.example.com)"

        prompt_generating_code: |
            runpy_result.add_result("prompt", prompt_tex)

        question_generating_code: |
            runpy_result.add_result("question", question_tex)

        answers_generating_code: |
            runpy_result.add_result("answers", answers_tex)

        answer_explanation_generating_code: |
            runpy_result.add_result("answer_explanation", answer_explanation_tex)

All the *_generating_codes will be concated with "common_code", and then transfered to send requests to runpy container, which return runpy_result (a dict), With that dict, we update the page_desc.

I put all related files in data_file attribute, including some scripts to be executed. All files (except random_question_data_file, which will be used for generate numerous variant pages, and except those explicitly named in excluded_cache_key_files) and some attributes (for example, runpy_context) will be taking to build a hash, named context_hash. Each page has a question_data which is picked randomly from random_question_data_file. question_data and context_hash is then used to generate a md5 string, which is then used to build a cach key.

dzhuang commented 6 years ago

The following problem are considered:

  1. The content of the on-the-fly page_desc is cached so as to speed up page loading, when commit_sha changes, Unlike page cache, the cache is semi-disjoint with commit_sha.

    • When commit_sha is changed, while context_hash (representing the actual page_desc before generating the on-the-fly page_desc) doesn't change, the result is read from cache. If no cache, read from a mongodb collection (the result is saved to the collection in previous visit or by some warm_up mechanism.
    • If context_hash changed when commit_sha changed, context_hash will be recaculated (only once), and share with all pages using that same hash (through some entry in another mongodb collection). In order to save those information, i.e., context_hash and question_data, and a context_hash_id are save in page_context_page_data. Thus the api need an update_page_data method to support this.
  2. warm up mechanism. Because the on-the-fly page_desc is generated by pure python (with Jinja2, which require runpy-image to install more packages), sometime we need to perform some expensive calcuation, although there are cache and mongodb, warming up before actual page visits are good for experience of students. Currently, my idea is to use sandbox to do the warm-up. An attribute in page_desc called warmup_by_sandbox with default True is designed, for instructors to copy the yaml content into sandbox, and warm up all questions by saving all generated page_desc in to the mongodb collection.

inducer commented 6 years ago

Not sure I like this design a lot. I'd prefer it if some code could run that could inject some data into the page Jinja template (which then gets parsed as YAML). Ideally, that same data would also be injected into the markup templates. I suspect that would let us do everything we want, it's generic in terms of page types. I imagine that the code would run once and the injected data stored in the database, perhaps as part of page_info. That means that we're likely re-rendering that template once per student, but such is life.

Unfortunately, there's one chicken-and-egg problem. That code can't really live in a YAML attribute, because it needs to be findable before YAML parsing is complete. But I'm sure we can come up with some syntax that makes this half palatable.

The other question is how trusted the page generation code really is. Technically, code from the course repo right now can run with the privileges of the app. (You can have page subclasses in the repo.) Long-term, that's probably a bad idea, and so we'd need a sandboxed design like what you're showing. The problem with that, as you mention, is performance.

dzhuang commented 6 years ago

I did considered a solution like yours, but its limitation is obvious. The data need to be serializable when saved in database, or the data need to be serialized before pickle so that it can be loaded and then injected in to the template. In other word, that make it impossible to render the jinja2 template using some custom kind of variables (objects) which is not known to RELATE. Am I right?

dzhuang commented 6 years ago

Here is another version, which put all *_generating_code in to a runpy_file:

        type: variantInlineMultiQuestion
        id: rand1
        value: 3
        access_rules:
            add_permissions:
                - see_correctness
                - change_answer

        prompt: |
            # variant question test1

        data_files:
            - question-data/variant-page-test/jinja_env.py
            - question-data/variant-page-test/pyprog.py
            - question-data/variant-page-test/test_data_01.bin
            - question-data/variant-page-test/test_variant_runpy.py
            - question-data/variant-page-test/test_template.tpl

        excluded_cache_key_files:
            - question-data/variant-page-test/pyprog.py

        random_question_data_file: question-data/variant-page-test/test_data_01.bin

        runpy_file: question-data/variant-page-test/test_variant_runpy.py

        runpy_context:
            display_markdown: True
            pre_description: blablabla

with the content of the runpy file:

from io import BytesIO
import pickle

try:
    runpy_context
except NameError:
    runpy_context = {}

prog = data_files["question-data/variant-page-test/pyprog.py"].decode("utf8")
exec(prog)
jinja_env_str = data_files["question-data/variant-page-test/jinja_env.py"].decode("utf8")
exec(jinja_env_str)
template_file = data_files[
    "question-data/variant-page-test/test_template.tpl"].decode("utf8")
template = jinja_env.from_string(template_file)
bio = BytesIO(data_files["question_data"])

d = pickle.load(bio)

o = PlusFirstElement(**d)
q = "$%s=$" % " + ".join([str(i) for i in [o.x, o.y, o.z]])
answer = o.solve()

display_markdown = runpy_context.get("display_markdown", False)

blank_description = (
    u"the result:"
)

pre_description = runpy_context.get("pre_description", u"What is the result?")

prompt_tex = template.render(
    show_question=True,
    show_answer=False,
    pre_description=pre_description,
    q=q,
)
if display_markdown:
    prompt_tex += "[prompt](http://prompt.example.com)"
runpy_result.add_result("prompt", prompt_tex)

question_tex = template.render(
    show_question=False,
    show_answer=False,
    blank_description=blank_description,
    show_blank=True,
    q=q,
)
if display_markdown:
    question_tex += "[question](http://question.example.com)"
runpy_result.add_result("question", question_tex)

answers_tex = template.render(
    show_question=False,
    show_answer=False,
    show_blank_answer=True,
    blank_description=blank_description,
    q=q,
    answer=answer,
)
runpy_result.add_result("answers", answers_tex)

answer_explanation_tex = template.render(
    show_answer_explanation=True,
    answer="$%s$" % answer,
)
if display_markdown:
    answer_explanation_tex += "[explanation](http://explanation.example.com)"
runpy_result.add_result("answer_explanation", answer_explanation_tex)
dzhuang commented 6 years ago

As for

The other question is how trusted the page generation code really is

Is that the same problem we've encountered for custom jinja macro?

dzhuang commented 6 years ago

The major defect of my idea is, each question require a runpy run before being cached. Thus if not all the questions are cached, the server might temporality go down when there are huge amount of visits to that page in a short time.

However, it can bring convenience to modify the question, because we only store raw question data in database, rather than the result after caculation, as compare to the way you suggested above. If we only store the serializable calculation result, which is supposed to be used to render the jinaj template, it can be a nightmare when we want to modify an ongoing page (for example, do some more calculation and add more variable to the template).

inducer commented 6 years ago

I did considered a solution like yours, but its limitation is obvious. The data need to be serializable when saved in database, or the data need to be serialized before pickle so that it can be loaded and then injected in to the template. In other word, that make it impossible to render the jinja2 template using some custom kind of variables (objects) which is not known to RELATE. Am I right?

Yes, but I don't consider serializability to JSON/strings a major constraint. What type of data are you looking to hand to a template? Plus, if you're concerned about not trusting course code, then you can't use pickle. (because that's as good as local code execution)

each question require a runpy run before being cached.

My notion was that code would run at flow start, to create page_data. But I agree, there is a scalability concern if we do go the container route.

Here is another version, which put all *_generating_code in to a runpy_file:

I'm not likely to be a fan of a solution that spans multiple files.

dzhuang commented 6 years ago

For example, in my course Operations research, one part is linear programming, which spans 3 chapters and 6 weeks, I used the following code (just a glimps) to initialize and solve an linear programming problem.

class LP(object):
    def __init__(self, qtype="max", goal=None, constraints=None, x="x", x_list=None, sign=None, z="Z",
                 sign_str=None, dual=False,
                 sensitive={}, required_solve_status=[0, 1, 2, 3],
                 start_tableau=None, start_basis=None
                 ):

        self.goal = copy.deepcopy(goal)
        self.constraints = copy.deepcopy(constraints)
        self.x = x
        self.x_list = x_list
        self.z = z

        self.base_list = []
        self.tableau_list = []
        self.res = []
        self.fun = []
        self.tableau_origin = None

        self.solutionCommon = None
        self.solutionPhase1 = None
        self.has_phase_2 = True
        self.solutionPhase2 = None
        self.solve_status = Noe
        self.solve_status_reason = ""
        self.solve_status_message = ""
        self.need_artificial_variable = False

        # for dual simplex
        self.dual_opt_solution_str_list = []
        self.dual_opt_solution_list = []

        # for sensitivity analysis
        self.sa_result = []

        self.start_tableau = start_tableau
        self.start_basis = start_basis
        self.opt_x = []
        self.opt_value = []

    def solve(self, disp=False, method="simplex"):
        if method=="two_stage":
            klass= LP2StageSovler
       elif method == "big_m":
            klass= LPBigMSolver
       elif method == "modified_simplex":
            klass= LPModifedSimplexSolver
       (self.solutionPhase1,  self.has_phase_2, self.solutionPhase2, 
        self.solve_status, self.solve_status_reason, self.solve_status_message,
        self.need_artificial_variable, self.opt_x, self.opt_value) = (
               klass.solve(self.start_tableau, self.start_basis, disp=disp))

class LPSolverBase(object):
    def __init__(self, ....):
        ....

    def solve(self):
        raise NotImplementError()

class  LP2StageSolver(LPSolverBase):
    def solve(self):
        from scipy.optimize import linprog
        ....
       return ........

class LPBigMSolver(LPSolverBase):
    def solve(self):
        .....
        return .......

class LPModifedSimplexSolver(LPSolverBase):
    def solve(self):
        ....
        return .......

An LP instance can be init by:

lp_dict = {
    'qtype': 'min', 'goal': [3, 5, 1], 
    'constraints': [[-3, 4, 1, '>', 9], [-1, 8, 4, '>', 16]]}
lp = LP(**lp_dict)

The lp_dict is serializable.

As you can see, when I render the template, I can just pass an LP object (after solved with specific method) to the template, both for question body and answer, and by adding extra context (show_question, show_answer, show_answer_explanation), I can use one template for all type of questions.

For your concern:

What type of data are you looking to hand to a template?

The data in the solved result include np.ndarray, iterator, np.matrix, bool, string, and some data structure from scipy. As you can see, np.ndarray and np.matrix is not serializable, let alone other unknow data structure.

dzhuang commented 6 years ago

I'm not likely to be a fan of a solution that spans multiple files.

I can understand your preference. However, as you can see from my use case, this style can help to make yaml cleaner when duplicate code need to be reused across pages (For my case, the length of the code is about 2000 lines), and runpy_context can be used to generate different questions context, also possible with different random data set, without modifying the python file and template. Actually, my design allow both pure yaml and external file style.

inducer commented 6 years ago

If we allow passing that type of data to the template, then rendering the page template becomes something that has to happen in a container. I'd really like to avoid that. Instead, I would much prefer doing randomization/variant creation once during page data creation. We're already Jinja-expanding our YAML pages, I feel like that should suffice for making page variants. (plus all that is fully determined by the JSON going in and thus cacheable)

Note that that's not really a functional restriction. You can (theoretically) still do all the same things you could do before: You could render the whole page YAML into a string and then make the page template consist of:

{{ my_glorious_page_yaml }}

That's fine, but the price you pay is that the entire YAML ends up in the database for every page data instance (which is probably totally fine cost-wise).

As far as non-JSON types as input to templates, I'm not buying the real need for them. For your example, numpy.ndarray.tolist() gives you an equivalent, JSON-compatible object.

I'm coming around to the two-file solution. How about something like the following:

generate_page_data:
    file: "gen.py"
    entrypoint: my_gen_function
    imports:
    - abc.py
    - xyz.py
---
type: ChoiceQuestion
id: yoink
value: 5
prompt: |
    # Choose a {{ type_of_thing }}
choices: {{ yaml_choices }}

or

generate_page_data:
    code: |
        type_of_thing = "color"
        yaml_choices = ["green", "~CORRECT~ red"]

---
type: ChoiceQuestion
id: yoink
value: 5
prompt: |
    # Choose a {{ type_of_thing }}
choices: {{ yaml_choices }}

To be recognized, the first line of the document would have to be exactly: generate_page_data:. The --- is actually also a YAML thing, they call it a document separator. So, neglecting Jinja expansion of the second half (and potentially the first half?), it's still a fully valid YAML document.

dzhuang commented 6 years ago

For your example, numpy.ndarray.tolist() gives you an equivalent, JSON-compatible object.

Yes, I knew that, and np.matrix also have that method. In that way I'll need to create a dict with huge amount of items :-(.

The two snippets you gave above seems to go to the container route? (Then that idea has no different with mine) I image your suggestion is:

generate_page_data:
    datafile: "abc.bin"
---
type: ChoiceQuestion
id: yoink
value: 5
prompt: |
    # Choose a {{ type_of_thing }}
choices: {{ yaml_choices }}

or

generate_page_data:
    type_of_thing_list: ["color", "food", "book"]
    yaml_choices_list:  [["green", "~CORRECT~ red"], ["egg", "~CORRECT~ bread"], ["book1", "~CORRECT~ book2"]]

---
type: ChoiceQuestion
id: yoink
value: 5
prompt: |
    # Choose a {{ type_of_thing }}
choices: {{ yaml_choices }}

If I'm correct, your idea is to save both the question and result in the random data set? It can be horrible in this way if I want to correct (or alter the structure of) the question when all students have started the the session with that page, because it's almost impossible to update all pagedata instance in one go. For example:

generate_page_data:
    first_number_list: [1, 1, 1, 2]
    second_number_list: [2, 3, 4, 5]
    method_list: ["add", "add", "multiply", "divide"]
    yaml_answers_list: [["<plain>2"], ["<plain>4"], ["<plain>4"], ["<plain>0.4"]]

---
type: TextQuestion
id: yoink
value: 5
prompt: |
    # Do some calculation
    ${{ first_number }} {% if method=="add" %}+{% elif method=="multiply" %}\times{% elif method=="multiply" %}/{% endif %}{{ second_number }}$
answers: {{ yaml_answers }}

Then some pagedata instance saved some thing like:

{'first_number': 1, 'second_number': 2, 'method': 'add', 'yaml_answers': ["<plain>2"]}

However, the above "correct answer" happened to be incorrect. For another case, for the 4th question, a student input '2/5' as answer. Those pages were graded as incorrect after submission. In these cases, there seems to be no other choices for us other than to manually correct all wrong page data before a bulk regrade can be executed.

inducer commented 6 years ago

I can see that incorrectly generated page data would pose an issue that's potentially difficult and laborious to fix. The best solution I can think of would be to make the generation of this data "pure", i.e. not dependent on any saved state. Just feed some (per-page-constant) random seed data into the page data creation, and the same page data comes out every time. Then we could simply cache the generated page data based on that seed and the course revision. I'm liking this better than the current saved page data construction, which I now think of as misguided.

Note that I'm conflating page data and the var context that goes into Jinja expansion. Not sure if that's how it should be ultimately. FWIW, there is a chicken and egg problem in this design: some data generation may depend on being able to parse the YAML, but that's only possible once the page data is generated.

All this would of course only work well if the overhead of the data generation is low. Spawning a container for it is definitely too heavy, but maybe we can have one container per app server process (reused across requests) that acts as a "server" for running that code, where we would have to trust the generation code to not be so destructive that it would make the server unusable for subsequent requests (e.g. del sys.path[:]).

dzhuang commented 6 years ago

The best solution I can think of would be to make the generation of this data "pure", i.e. not dependent on any saved state.

Yes, and that's the what I want to express above: "However, it can bring convenience to modify the question, because we only store raw question data in database, rather than the result after caculation, as compare to the way you suggested above. " (sorry for my bad expression)

Just feed some (per-page-constant) random seed data into the page data creation, and the same page data comes out every time. Then we could simply cache the generated page data based on that seed and the course revision.

What I've done is using those data generation attributes (i.e., *_generating_code, or external files runpy_file which you proposed to name as import attributes and file attribute, and some variables, what I named runpy_context) to build a per page unique string (maybe this is what you called seed data), along with the random data picked to build a md5 hash, as cache key, for caching the page_desc per student. Those keys are save in page_data so that I don't need to caculate the key each time when that page is visited.

However, saving the key in page_data brings the need to add an api call update_page_data in _adjust_flow_session_page_data_inner when there are changes to those data generation attributes. With that api, I can determine whether the data generation process need to be re-ran for existing pages if the commit_sha changed.

Saving that key to page_data is also important because cache will expire. So I use mongodb to save the generated page_desc, and use the key to search for the result when cache expire.

dzhuang commented 6 years ago

I admit the overhead is a huge problem, and currently I'm using the spawning-a-container way which is not acceptable to you.

dzhuang commented 6 years ago

(because I'm the only one who use that kind of question on my instance)

inducer commented 6 years ago

Yes, and that's the what I want to express above: "However, it can bring convenience to modify the question, because we only store raw question data in database, rather than the result after caculation, as compare to the way you suggested above. " (sorry for my bad expression) Well, sorry for not understanding what you were saying the first time around. I am thinking the good news is that it seems we're now basically in agreement about the large-scale design.

What remains to be figured out:

However, saving the key in page_data, that bring need to add an api call update_page_data in _adjust_flow_session_page_data_inner when there are changes to those data generation attributes.

I don't understand that. How would this need to change?

What I am envisioning is (irrespective of whether the question needs it or not) supplying 256 "random" bits to the question that depend on:

(one for each) that it can use to determine a variant. As long as we make sure that we can always reconstruct the random/hashed bits, there is nothing to save, since I imagine this derivation would be quite quick. (AES or one of the SHAs comes to mind as a way of doing this.) I think this is sufficient to move away from page_data altogether. (Though we need to keep supporting it for old data.)

The ideas are that these 'random' bits could then be used to determine variants by pseudorandom number generation with this seed.

I suspect many types of customization (shuffle/choose from alternatives/...) can be done just from within the Jinja template, with the availability of an RNG.

The Jinja sandbox also exists, although I suspect that's too restrictive of an environment for the generation code.

dzhuang commented 6 years ago

More on the design:

  1. I think the pure data should be protected (from being modified), so there should be a field (other than page_data.data) in FlowPageData (page_info as you suggested), to save the information of the context, for example, the seed (which is indepedent of the pure data). That seed and the pure data is used to build a key for cache and for mongodb.

  2. We might need to make sure the course revision is not activate unless all variant pages are warmed up to avoid performanced issue. That should also be a mechanism to ensure subclasses from repo will also pass that step before validated, is there any way to do that?

dzhuang commented 6 years ago

(For the above comment, I didn't know you were replying).

dzhuang commented 6 years ago
  • the course instance
  • the course instance and the student
  • the course instance and the student and the flow session
  • the course instance and the student and the flow session and the question number

The ideas are that these 'random' bits could then be used to determine variants by pseudorandom number generation with this seed.

If I understand correctly, does that mean even if there's only one question in the random question set, to the exteme, each page will be cached although the cached contents are identical? And for 2 courses, where there are expecting 100 students each, 1 session per student, then we will need to run the gereration code 200 times in a container? instead of 1 (in my desgin).

What's more, a same random data set was used in one of my course for different context, to generate different kinds of questions (Text, Choice, Upload), and the above style cannot cover that.

My idea is to use all data generation attributes to construct (or reconstruct after course revision) a common hash(which I named context_hash), indepedent of course, student, session and randomly picked data. Then the randomly picked question data is combined with that common hash to generate a question specific hash (indepedent of course, student and session), to cache the generated the page_desc.

The common hash is mapped to commit_sha to tell whether this page_desc need to be updated after course revision.

dzhuang commented 6 years ago

However, saving the key in page_data, that bring need to add an api call update_page_data in _adjust_flow_session_page_data_inner when there are changes to those data generation attributes.

I don't understand that. How would this need to change?

If I've expressed myself clearly (maybe not yet :-( ), you can see I want to save the common hash and the question specific hash to the FlowPageData instance of each variant page, so as to get the cached generated data (from configured cache or from mongodb when cache expired). That is to prevent unnessasary re-generation of the question (or page_desc) when course is updated.

However, if the question generation attributes actually changed, then we need to update the common hash, along with the question specific hash, to let the page know, the question (or page_desc in my case) need to be updated, a container run is needed. In this case, we need to update that information in to the FlowPageData instance of the page visited.

But currently, only the data attribute is writeable and accessible in the API (page_data). I think using the FlowPageData instance (at least a json field other than data, perhaps page_info) to save the independent context and the page-specific seed will save a lot of time after course revision. However, that might really raise the problem of what you called chicken-and-egg. Because in my design, when rendering the page, it required to read from current flowpagedata to determine if another container run is required or stay intact using cached result.

dzhuang commented 6 years ago

In fact, I've ran that kind of questions for 2 semesters, each across 2 class with tiny difference (without much problem because I was testing the design with more than 100 tests). Most of the assignments are using what we call Variant pages here. However, I'm ok and glad to migrate to the expected new design.

I think the idea you proposed is valuable in:

  1. No need to subclass currently built-in types of questions. With my idea, I need to fake a valid page_desc which pass validation of parent question type, in the variant page subclass. And after the generation process, the updated page_desc need to be validated again.

  2. Clearer syntax

inducer commented 6 years ago

If I understand correctly, does that mean even if there's only one question in the random question set, to the exteme, each page will be cached although the cached contents are identical?

Yes, that's sort of what I was saying. Now that you spell out the consequences, that does seem silly. At the same time, I don't know that relying on a fan-in (i.e. reduction in number) of cached versions for scalability is wise. If, say, some unlucky instructor uses a random floating point number in a page (i.e. something that's likely to be different for every student), they would get horrible performance without necessarily understanding what the difference is between this and a page that and a page that only generates five different. Said differently, I would like it if the design could support the use case of "every student gets a different page" without performance degradation.

I also just realized that using per-question Jinja expansion is not completely natural, since we currently Jinja-expand whole flows, not single questions. This could be worked around, at some cost.

dzhuang commented 6 years ago

I would like it if the design could support the use case of "every student gets a different page" without performance degradation.

Do you mean the "unlucky instructor" use random or np.random.rand to generate a random number in the generation code? Workaround1: In documentation warn that in-place randomization is not allowed in this type of question, and make random_data_set a required attribute. To make sure "every student gets a different page", you can generating 10,000 random numbers and put them in the random data set.

Workaround2 (not favorable for me): add an alternative attributes named random_data_generation_code (an required attribute if random_data_set doesn't present) which returns a serializable or unpicklable data. The returned data is then saved in page_data, so that the generation code can determine what the random data actually is when the page need to be re-rendered (on course revision or when cache expired). And of course it should be documented that in-place randomization is also not allowed in this type of question.

And from here, we can see the actual needs for saving the random question data to page_data. If the question is modified after course revision, that's the only place we can know what the random data the page is using when re-run the calculation code and then re-render the page. We should always concern what will happen after page revision, which occurs from time to time, thus the random question data should always be saved per-page, otherwise students might (will almost definitely) complaint they got different question after cache expired or after course revision.

dzhuang commented 6 years ago

Do you agree that we need to use a more persistable storage (like mongodb) to save the runpy result, instead of cache alone? That could highly reduce the computational need for pages whose cache expired.

dzhuang commented 6 years ago

If you agree that we should use mongodb to store the calculated result, there's another idea for better performance: add a semi-validated stage before course revision is validated. Only course revision with variant pages need to pass the stage before being active. Instructors need to perform warm-up (using some non-block technique to do the runpy and save the result in mongodb) for each variant page before passing this stage. The runpy port for generating page is only allowed in semi-validated stage and closed after that. This also allow us to prevent repo subclass from being able to potentially degrade the performance by running pages that surpass the semi-validated stage.

Of course, this only works for questions which use random_data_set to pick the random data to generate the question. That's why I am not in favor of the workaround2 mentioned above.

That will also work round the spawning-a-container-per-page problem, because we can do all the calculation in one run (i.e., only spawn a container for all calculation.

inducer commented 6 years ago

Do you mean the "unlucky instructor" use random or np.random.rand to generate a random number in the generation code?

No, I meant doing everything as intended.

cache timed out

Can Django's caches even time out? I didn't think they could.

Do you agree that we need to use a more persistable storage (like mongodb) to save the runpy result, instead of cache alone?

No, I'm not sure I follow.

dzhuang commented 6 years ago

Can Django's caches even time out? I didn't think they could.

Sorry, I should have said expire.

dzhuang commented 6 years ago

I've updated my comments above.

dzhuang commented 6 years ago

Do you agree that we need to use a more persistable storage (like mongodb) to save the runpy result, instead of cache alone?

By this I mean when cache expired, we need to get the result from elsewhere rather than re-run the generation code.

inducer commented 6 years ago

By this I mean when cache expired, we need to get the result from elsewhere rather than re-run the generation code.

This whole engineering nightmare goes away (and the design gets much simpler) if we can make generation cheap (enough). So I'm much more interested in solving that problem. What did you think of the persistent generation server idea?

dzhuang commented 6 years ago

if we can make generation cheap (enough)

There's also the problem on what are code we should trust. If the process need to run 4 seconds (below the default timed out), it's also a nightmare to render the page in place.

What did you think of the persistent generation server idea?

That's good if the destructive code problem you mentioned above is worked around.