mitodl / ocw-studio

Open Source Courseware authoring tool
BSD 3-Clause "New" or "Revised" License
7 stars 3 forks source link

Improve Google Drive Backfill to Handle Non-Empty Folders #2170

Closed pt2302 closed 2 months ago

pt2302 commented 2 months ago

What are the relevant tickets?

Closes https://github.com/mitodl/hq/issues/4054.

Description (What does it do?)

This PR updates the Google Drive backfill command to be able to handle courses with non-empty Google Drive folders. It checks whether a DriveFile exists for a given resource and, if so, whether the download link is valid. If the DriveFile exists but has no valid download link, the command deletes the old DriveFile, creates a new DriveFile, and uploads the file to Google Drive. If no DriveFile exists, it simply creates the new DriveFile and uploads the file to Google Drive (as before).

How can this be tested?

The following pre-requisites should be set up, including the relevant .env variables:

  1. Pick a legacy course that may have Google Drive content and also contains non-video resources for testing. Alternatively, create a course, fill the Google Drive with some files, sync with Google Drive, and then delete the files from Google Drive but not sync again. This should ideally be tried for courses that do and don't have existing DriveFiles for the resources.
  2. Navigate to https://ocw.mit.edu/courses/<course name>/download/ and download the course ZIP.
  3. Spin up OCW Studio with docker compose up.
  4. Extract the ZIP file, and copy the contents of the static_resources subfolder to ol-ocw-studio-app/courses/<course name> on Minio (navigate to http://localhost:9001 and then use the Minio UI to get there).
  5. Run docker compose exec web ./manage.py backfill_gdrive_folder --filter <course name or short-id>.
  6. Check the Google Drive folder for the website to ensure that the resources have been uploaded correctly. Also, check the Django admin to ensure that (updated) DriveFile objects have been created for each resource.
  7. Try syncing the course with Google Drive, and verify that nothing in the course has been changed.
ibrahimjaved12 commented 2 months ago

@pt2302 I think the flow may not be working as expected. Can you please share some cases to try for?

Here's the simple case:

This is our normal OCW-Studio flow.

  1. Create a course
  2. Upload 2 pictures in its Gdrive folder.
  3. Sync
  4. At this point you can see we have our respective driveFiles created, and the files are in Minio as well. And we have the files in Gdrive too (because we haven't deleted yet).
  5. Run backfill management command.

Output:

Processing website: ibrahims-cat-course
No file found at https://drive.google.com/uc?id=xyz&export=download for resource courses/ibrahims-cat-course/cat9.jpeg. Deleting DriveFile and continuing.
Downloading file courses/ibrahims-cat-course/cat9.jpeg from S3 bucket ol-ocw-studio-app.
courses/ibrahims-cat-course/cat9.jpeg uploaded to Google Drive folder.
No file found at https://drive.google.com/uc?id=xyz_&export=download for resource courses/ibrahims-cat-course/meow6.jpeg. Deleting DriveFile and continuing.
Downloading file courses/ibrahims-cat-course/meow6.jpeg from S3 bucket ol-ocw-studio-app.
courses/ibrahims-cat-course/meow6.jpeg uploaded to Google Drive folder.

It seems to always fail to find files in Gdrive, which do exist, and their URL is also correct. When it fails to find the resource files in Gdrive, it deletes the respective DriveFile (and creates another), and we have this signal triggered:

@receiver(pre_delete, sender=DriveFile)
def delete_from_s3(sender, **kwargs):  # pylint:disable=unused-argument  # noqa: ARG001
    """
    Delete the drive file from S3
    """
    drive_file = kwargs["instance"]
    delete_s3_objects.delay(drive_file.s3_key)

And then our files are deleted from s3, which were not problematic in the first place. At this point, in Gdrive, you now have duplicates of those files. So there are 4 files in total, while our Minio is empty.

https://github.com/mitodl/ocw-studio/pull/2170/files#diff-b9ad326e87c6da4a4083953e2997fb2f4e5eed97ad6f194996cc36526a2ddbaeR142-R151

These are the lines of code that are doing this.

I tried this using my Gdrive Credentials, and also RC's Gdrive credentials. Got same result.