Closed ormsbee closed 8 months ago
Okay, thinking on this a little more–the FileField is pretty much entirely redundant, as well as being one of the biggest parts of the table. The storage location is based entirely on the LearningPackage UUID + the Content
hash digest. I'm going to see if it's cumbersome to make it a boolean flag + accessor method.
Other self notes before I forget:
Content
for compressed text (e.g. using zlib
compression). This could save us a lot of space, at the expense of making it more difficult to query. MySQL has a compressed row type sounds perfect for the usage pattern Content
has, but we can't use it in RDS Aurora.The compression thing came to mind because I was looking through some example course data and there are a handful of Capa problems that weigh in at ~13-14 KB. But when compressed with zlib
, that goes down to about 2K–the larger problems tend to be that way because they have a lot of Python code and HTML table markup, both of which compress really well.
There are HTML blocks that are dramatically larger than this, but that's because they're encoding images into the raw HTML using base64-encoded data URLs (<img src="image/png;base64,....>"
). We are definitely not supporting that.
Okay, I poked into the compression thing just a little bit further. I'm going to stop now because it's not critical to get in for the short term, but I have a general plan for it:
text
field to uncompressed_text
.BinaryField
for compressed_text
.text
that knows how to switch between the two.At the time of write, we run zlib compression on the text and decide whether to use the compressed or uncompressed field for this row. The other field is left null. When we first introduce this feature, we can run it as a data migration, though that wouldn't be a requirement.
Pruning is still the more important feature for controlling the content size growth.
@bradenmacdonald, @kdmccormick: A little later than I had hoped, but it's ready for real review.
@bradenmacdonald, @kdmccormick: Do you folks know what this mypy error is about by any chance?
openedx_learning/core/contents/api.py:152: error: Incompatible type for "text" of "Content" (got "None", expected "Union[str, Combinable]") [misc]
@ormsbee It's telling you that when constructing a Content
model instance, you can't set text=None
; it wants you to set it to an actual str instance. Would it make sense to do text=""
here, or is that fundamentally different?
You can ignore the Combinable
thing--that's like F
and Q
objects.
@ormsbee It's telling you that when constructing a
Content
model instance, you can't settext=None
; it wants you to set it to an actual str instance. Would it make sense to dotext=""
here, or is that fundamentally different?
It's fundamentally different in this case. text=None
means "there is no text representation in the database for this Content–it only exists in the file store". text=""
means "there is a text representation in the database for this Content, and that happens to be an empty string".
But it's a nullable field, so it should be permitted. Does the type-checker just not accept nullable text fields?
But it's a nullable field, so it should be permitted. Does the type-checker just not accept nullable text fields?
That would surprise me, so I tested it out by changing the definition of text
to a regular TextField:
# text = MultiCollationTextField(
text = models.TextField(
blank=True,
null=True,
max_length=MAX_TEXT_LENGTH,
# We don't really expect to ever sort by the text column, but we may
# want to do case-insensitive searches, so it's useful to have a case
# and accent insensitive collation.
# db_collations={
# "sqlite": "NOCASE",
# "mysql": "utf8mb4_unicode_ci",
# }
)
and that type-checked fine (except for a new error that pops up on openedx_learning/core/components/admin.py:167, where I think you need to change content_obj.text
to context_obj.text or ""
).
So, I think the issue that the field-value type argument (specifically, str|None
) isn't getting passed through MultiCollationTextField into TextField for some reason. It's possible that we need to define MultiCollationTextField as a generic class, although I'll need to play with that a bit to figure out the right syntax.
If that ends up blocking this PR, you could hack around the error for now by adding a type annotation directly to text
:
# TextField type args are [TypeForGetting,TypeForSetting]
text: models.TextField[str|None, str|None] = MultiCollationTextField(
blank=True,
null=True,
max_length=MAX_TEXT_LENGTH,
# We don't really expect to ever sort by the text column, but we may
# want to do case-insensitive searches, so it's useful to have a case
# and accent insensitive collation.
db_collations={
"sqlite": "NOCASE",
"mysql": "utf8mb4_unicode_ci",
}
)
@ormsbee if you want to just work around this for now using text: models.TextField[str|None, str|None]
, I opened https://github.com/openedx/openedx-learning/issues/152, which the type-checking nerd in me would be happy to work through at some later date.
The only reason I balk at just using # type: ignore
is that, if left unresolved, we'd lose some type safety in edx-platform wherever .text
is referenced, since the field value's type is in fact str|None
, not str
, even if mypy thinks it's the latter.
@kdmccormick: I respect and appreciate your inner type-checking nerd. I'll use the workaround you suggested. Thank you.
This is mostly pulling together
RawContent
andTextContent
into a unifiedContent
model in order to reduce confusion and not force text to always go to a file-based storage backend.Other things that may also be a part of this PR as issues I've noticed along the way:
FileField
to save space (and use the low level storages API instead)lru
in rollback situations.