learningequality / ricecooker

Python library for creating Kolibri channels and uploading to Studio
https://ricecooker.readthedocs.io/
MIT License
20 stars 54 forks source link

Non-deterministic creation of perseus exercises #253

Open ivanistheone opened 4 years ago

ivanistheone commented 4 years ago

Description

Running Khan Academy zh-CN and it with nearly identical source trees resulted in all the exercises being marked as different and having to re-import them.

What I Did

Expected

Only one or two nodes to have changed (the topics nodes modified).

Actual

Every single exercise had changed and required re-importing.

Possible causes

I suspect two possible cause:

  1. this line could be the cause: https://github.com/learningequality/ricecooker/blob/master/ricecooker/classes/questions.py#L275 and since running on Python 3.5 the dict orders are not guaranteed to be consistent between different runs.

  2. The other possible cause is the .perseus file generation on Studio could be non-deterministic, see https://github.com/learningequality/studio/blob/develop/contentcuration/contentcuration/utils/publish.py#L342-L359 which uses create predictable zip code code slightly different from the predictable zip used in riececooker.

Real life consequences

Khan Academy channel users will need to redownload many files (small but still many) every time a new channel is published, even though exercises are substantially the same.

ivanistheone commented 4 years ago

An alternative way to store the "cache" of generated persues files --- let's just add a new table:

class GeneratedExerciseFileCache(models.Model):
    """
    A lookup table to avoid re-generating exercise export formats (perseus specifically).
    If the current `md5(exercise_data)` of an exercise node matches the
    `exercise_data_hash` of some row in this table, reuse the perseus .zip file `file`.
    """
    # id = autoincrementing int
    exercise_data_hash = models.CharField(max_length=400, blank=True, db_index=True)
    file = models.ForeignKey('File', null=True, blank=True, related_name='_')
    created = models.DateTimeField(auto_now_add=True, verbose_name=_("created"))
ivanistheone commented 4 years ago

Update post cpUs luncheon (June 17)

Much simpler solution to add a exercise_data_changed (bool) field on ContentNode (only relevant for kind=exercise). Credit @kollivier

on create_perseus_exercise

First PUBLISH:

Subsequent PUBLISH after noop:

Subsequent PUBLISH after exercises edited:

Frontend requirements

ivanistheone commented 4 years ago

See more recent discussion on Studio here: https://github.com/learningequality/studio/issues/1982