Combine RawContent and TextContent into Content

ormsbee commented 8 months ago

This is mostly pulling together RawContent and TextContent into a unified Content model in order to reduce confusion and not force text to always go to a file-based storage backend.

Other things that may also be a part of this PR as issues I've noticed along the way:

reduced the primary key for LearningPackage to 4-bytes to reduce the size of indexes.
maybe eliminate the FileField to save space (and use the low level storages API instead)
fix a potential cache corruption issue with lru in rollback situations.

ormsbee commented 8 months ago

Okay, thinking on this a little more–the FileField is pretty much entirely redundant, as well as being one of the biggest parts of the table. The storage location is based entirely on the LearningPackage UUID + the Content hash digest. I'm going to see if it's cumbersome to make it a boolean flag + accessor method.

ormsbee commented 8 months ago

Other self notes before I forget:

The text should be null, not blank, when the data is only in the file.
In the future, we might want to allow a blob field in Content for compressed text (e.g. using zlib compression). This could save us a lot of space, at the expense of making it more difficult to query. MySQL has a compressed row type sounds perfect for the usage pattern Content has, but we can't use it in RDS Aurora.

ormsbee commented 8 months ago

The compression thing came to mind because I was looking through some example course data and there are a handful of Capa problems that weigh in at ~13-14 KB. But when compressed with zlib, that goes down to about 2K–the larger problems tend to be that way because they have a lot of Python code and HTML table markup, both of which compress really well.

There are HTML blocks that are dramatically larger than this, but that's because they're encoding images into the raw HTML using base64-encoded data URLs (<img src="image/png;base64,....>"). We are definitely not supporting that.

ormsbee commented 8 months ago

Okay, I poked into the compression thing just a little bit further. I'm going to stop now because it's not critical to get in for the short term, but I have a general plan for it:

Rename the text field to uncompressed_text.
Create a new BinaryField for compressed_text.
Create a cached property text that knows how to switch between the two.

At the time of write, we run zlib compression on the text and decide whether to use the compressed or uncompressed field for this row. The other field is left null. When we first introduce this feature, we can run it as a data migration, though that wouldn't be a requirement.

Pruning is still the more important feature for controlling the content size growth.

ormsbee commented 8 months ago

@bradenmacdonald, @kdmccormick: A little later than I had hoped, but it's ready for real review.

ormsbee commented 8 months ago

@bradenmacdonald, @kdmccormick: Do you folks know what this mypy error is about by any chance?

openedx_learning/core/contents/api.py:152: error: Incompatible type for "text" of "Content" (got "None", expected "Union[str, Combinable]") [misc]

kdmccormick commented 8 months ago

@ormsbee It's telling you that when constructing a Content model instance, you can't set text=None; it wants you to set it to an actual str instance. Would it make sense to do text="" here, or is that fundamentally different?

You can ignore the Combinable thing--that's like F and Q objects.

ormsbee commented 8 months ago

@ormsbee It's telling you that when constructing a Content model instance, you can't set text=None; it wants you to set it to an actual str instance. Would it make sense to do text="" here, or is that fundamentally different?

It's fundamentally different in this case. text=None means "there is no text representation in the database for this Content–it only exists in the file store". text="" means "there is a text representation in the database for this Content, and that happens to be an empty string".

But it's a nullable field, so it should be permitted. Does the type-checker just not accept nullable text fields?

kdmccormick commented 8 months ago

But it's a nullable field, so it should be permitted. Does the type-checker just not accept nullable text fields?

That would surprise me, so I tested it out by changing the definition of text to a regular TextField:

#    text = MultiCollationTextField(
    text = models.TextField(
        blank=True,
        null=True,
        max_length=MAX_TEXT_LENGTH,
        # We don't really expect to ever sort by the text column, but we may
        # want to do case-insensitive searches, so it's useful to have a case
        # and accent insensitive collation.
#        db_collations={
#            "sqlite": "NOCASE",
#            "mysql": "utf8mb4_unicode_ci",
#        }
    )

and that type-checked fine (except for a new error that pops up on openedx_learning/core/components/admin.py:167, where I think you need to change content_obj.text to context_obj.text or "").

So, I think the issue that the field-value type argument (specifically, str|None) isn't getting passed through MultiCollationTextField into TextField for some reason. It's possible that we need to define MultiCollationTextField as a generic class, although I'll need to play with that a bit to figure out the right syntax.

If that ends up blocking this PR, you could hack around the error for now by adding a type annotation directly to text:


    # TextField type args are [TypeForGetting,TypeForSetting]
    text: models.TextField[str|None, str|None] = MultiCollationTextField(
        blank=True,
        null=True,
        max_length=MAX_TEXT_LENGTH,
        # We don't really expect to ever sort by the text column, but we may
        # want to do case-insensitive searches, so it's useful to have a case
        # and accent insensitive collation.
        db_collations={
            "sqlite": "NOCASE",
            "mysql": "utf8mb4_unicode_ci",
        }
    )

kdmccormick commented 8 months ago

@ormsbee if you want to just work around this for now using text: models.TextField[str|None, str|None], I opened https://github.com/openedx/openedx-learning/issues/152, which the type-checking nerd in me would be happy to work through at some later date.

The only reason I balk at just using # type: ignore is that, if left unresolved, we'd lose some type safety in edx-platform wherever .text is referenced, since the field value's type is in fact str|None, not str, even if mypy thinks it's the latter.

ormsbee commented 8 months ago

@kdmccormick: I respect and appreciate your inner type-checking nerd. I'll use the workaround you suggested. Thank you.

ormsbee commented 8 months ago

@kdmccormick: Incorporated all your suggestions except this one on get_component_by_key/component_exists_by_key.

openedx / openedx-learning

Combine RawContent and TextContent into Content #149