Download remote files and include in QTI folder?

ururk commented 3 years ago

We are using this tool to generate quizzes for Canvas, and one of our workflows would benefit from being able to have the script download image assets locally (via parameter), rather than link to the respective website. I have a general idea where to make this change in the codebase, and could make a pull request if there's interest in it, and if this is something that would get merged.

gpoore commented 3 years ago

Can you say a little more about how a feature like this would be used? For example, are you thinking about a command-line option that specifies a location for downloading images, or somewhere in the text file that specifies this? Are there advantages to having this built into text2qti, versus having a script that downloads the images if they don't exist, and then always running the script before text2qti?

ururk commented 3 years ago

Sure.

The way I was thinking it would work would be via command line options:

text2qti --download-external-resources --path-to-store-downloads (defaults to dl folder relative to md document)

Reasoning:

We are developing an internal website that generates QTI files - it firsts makes a markdown representation - then runs it through text2qti* - but we are also giving the instructor the ability to download the markdown. Markdown doesn't store images so either we bundle those images in a zip, or link to our website. However, those URLs might not be reachable by canvas, so I would prefer to upload them via QTI rather than link to an external resource. As part of the site, we'll also be allowing trusted users to upload a markdown document and have it run it through text2qti. For instance, they might want to edit the quiz before importing the QTI into canvas. Not all are familiar enough with python or can install the windows GUI tool. So my thought was, when generating the markdown on our website, use URLs instead of file paths, and when we run the tool have text2qti download the resources.

That being said - I could easily pre-process the uploaded markdown to download the resources locally. I could imagine a scenario where for authentication reasons having your tool do the download could get tricky/impossible... so perhaps it doesn't fit directly in the package.

*Long-term plans are to work with canvas APIs, but since other quizzing platforms support QTI this is a great first option for cross-platform quiz question generation. The general idea is to generate hundreds of questions to help create random quizzes in canvas.

gpoore commented 3 years ago

I've thought about this a little more, and have an alternative suggestion. How about new command-line options like --download-images and --image-directory. This would download all linked images to the specified directory and treat them as local files, rather than just including links in the QTI. Maybe just limit to http and https links. I could see a feature like this being useful for many people (avoid downloading images manually), and I think it would be able to do what you want.

Of course, there might be some complexity about keeping the cached images updated and possibly also some questions about what to name the files, but I expect that implementing basic functionality would be relatively straightforward. text2qti is already using a custom ImageInlineProcessor subclass for the markdown package to process images, so adding a basic version of this shouldn't be difficult.

ururk commented 3 years ago

That sounds excellent - I think it would be up to the user to either empty out the image directory or the default would be to skip already downloaded. Obviously this needs a lot of thought:

What about identically named files at different URL paths (ie, https://example.com/session1/image.jpg, https://example.com/session2/image.jpg) - make files named after a hash of the contents?

What if someone links to a file without an extension?

Just some initial thoughts.

gpoore commented 3 years ago

For identical files: I'd suggest creating subdirectories within the cache that are based on hashes of the complete URL path without the file name, and then putting files within those while preserving the original name. So https://example.com/session1/image.jpg goes in <cache>/<hash>/image.jpg, where <hash> is based on example.com/session1 or even just https://example.com/session1 for simplicity. Probably something like a BLAKE2B digest converted to hex and then truncated to 16 characters would be sufficient, possibly with a cache index retaining full information to eliminate collisions. If users are ever expected to look at the cache, then maybe something more like <cache>/<domain>/<hash>/image.jpg would be better.

For links without extensions: Maybe just restrict to common image extensions, at least at first.

gpoore / text2qti

Download remote files and include in QTI folder? #45