mitodl / ocw-studio

Open Source Courseware authoring tool
BSD 3-Clause "New" or "Revised" License
7 stars 3 forks source link

Check external resources for broken links #2171

Closed HussainTaj-arbisoft closed 1 month ago

HussainTaj-arbisoft commented 2 months ago

What are the relevant tickets?

Closes https://github.com/mitodl/hq/issues/3922

Description (What does it do?)

This PR implements the plan described in https://github.com/mitodl/hq/issues/3922.

What still needs to be done (this was my plan, you may alter it as needed):

How can this be tested?

  1. switch to branch 'hussaintaj/3922-check-external-resources'
  2. Restart containers.
  3. Apply DB migrations: docker-compose run web ./manage.py migrate
  4. From celery logs, verify a repeated task should be added in the celery with name check-broken-external-urls. Task is set to repeat every week. Frequency can be modified by changing CHECK_EXTERNAL_RESOURCE_STATUS_FREQUENCY in main/settings.py
  5. Go to http://localhost:8043/admin/external_resources/externalresourcestate/. This should be created after migrations applied
  6. External Resource States should have been populated with existing url checks after first execution of task

Additional Testing

To manually test the functionality locally with a batch of urls

  1. Spin up containers
  2. Get into Django shell: docker-compose exec web ./manage.py shell
  3. Run following commands to import necessary code:
    • from external_resources.tasks import check_external_resources
    • from websites.models import WebsiteContent
    • from websites.constants import CONTENT_TYPE_EXTERNAL_RESOURCE
  4. Get website content data: external_resources = list(WebsiteContent.objects.filter(type=CONTENT_TYPE_EXTERNAL_RESOURCE).values_list("id", flat=True))
  5. (Optional) check length of data and splice if necessary: len(external_resources) and external_resources = external_resources[:1]
  6. run checking task: check_external_resources(external_resources)
  7. to validate resource was updated or created, retrieve resource from.DB:
    • resources = WebsiteContent.objects.filter(id__in=external_resources).select_related("external_resource_state")
    • resource = next(resources.iterator())
    • str(resource.external_resource_state.last_checked)
  8. repeat step 7 to update and validate time stamps again.

To manually execute task in celery

  1. Spin up containers
  2. Get into Django shell: docker-compose exec web ./manage.py shell
  3. Run following commands to import necessary code:
    • from external_resources.tasks import check_external_resources_for_breakages
    • results = check_external_resources_for_breakages.apply() ** This might produce large number of urls to check and may take some time to complete.
umar8hassan commented 2 months ago

@gumaerc can you confirm if we really need to add a separate queue. Because assigning rate_limits and low priority to tasks should be enough

gumaerc commented 2 months ago

@umar8hassan I think that's a sensible approach, and at the end of the day if it works then it works. My thought process behind potentially having a separate queue for this task is that since there are a large amount of links that would have to be checked, the queue would get clogged with tasks whenever this is running. Even if the tasks don't have high CPU usage, the large amount of them could hold up more important tasks. If we can adjust priority to make sure that tasks such as course publishing have absolute priority and kick everything else out of the way, then I think that would be fine. Really the test will be running this in RC and making sure we can still publish courses unimpeded while link processing is happening. If you're confident that rate limits and priority can guarantee that, then go for it.

pt2302 commented 1 month ago

Tests are failing; it looks like it's because the constants are now in a different file.

umar8hassan commented 1 month ago

Tests are failing; it looks like it's because the constants are now in a different file.

Sorry about that. I had fixed that for api_test.py My change might not have been staged properly. I'll check and push again.

pt2302 commented 1 month ago

As suggested by @gumaerc, I think it would be helpful to add a feature flag to disable the celery task if necessary, since this is a lower-priority task than publishing, syncing, the video workflow, etc.

pt2302 commented 1 month ago

The actual Django objects for the external resources don't appear to be getting updated for me. To reproduce, take a particular external resource and change the external_url field in Django admin to a URL that doesn't exist. Then, run the check as in this PR's testing instructions. The is_broken field does not appear to have been updated.

umar8hassan commented 1 month ago

The actual Django objects for the external resources don't appear to be getting updated for me. To reproduce, take a particular external resource and change the external_url field in Django admin to a URL that doesn't exist. Then, run the check as in this PR's testing instructions. The is_broken field does not appear to have been updated.

I have tested and it seems to be working fine for me. Here is the attached video of my test.

https://github.com/mitodl/ocw-studio/assets/71461724/eda992a2-93fd-4434-b9f3-131d5e0b8a0c

However, there are cases where it won't update the field:

  1. Setting the external_url field to google.com without scheme/protocol or any invalid url. It'd have resulted in an error raised by the requests module, resulting to last check failed error. Thus retaining the last state of is_broken field.
  2. After is_broken field has been set to true for a broken resource. manually updating the external_url field to a valid url will not automatically update the is_broken to false or "".
  3. Adding a valid backup_url will also not update the is_broken field.