recalapp / recal

First a COS 333 project, now a very popular tool at Princeton for course selection
http://recal.io
MIT License
12 stars 3 forks source link

Scraper does not update existing meeting times #323

Closed maximz closed 7 years ago

maximz commented 7 years ago

The registrar rescheduled MAT325 from 1:30pm to 11am.

The webfeed is up-to-date: http://etcweb.princeton.edu/webfeeds/courseofferings/?term=1174&subject=MAT&catnum=325

We rescraped, but the Meeting object (id 84534012) still has a start time of 1:30pm.

maximz commented 7 years ago

PR here: https://github.com/maximz/recal/pull/324 (#324)

Interestingly, dev tier had correct time. How is this possible, given that we scraped on dev tier before we did on prod?

maximz commented 7 years ago

Turns out scraper has this error, seemingly on prod only:

Scraping for semester 1174 get() returned more than one Course -- it returned 2!

You can read scraper logs using: heroku logs --app newice -n 1500 | grep scheduler

This is where the issue lies.

maximz commented 7 years ago

Here's how to find the culprit:

> sudo apt-get install postgresql-client # need local psql
> heroku pg:psql --app newice
> newice::COPPER=> select registrar_id,semester_id, count(*) from course_selection_course GROUP BY registrar_id,semester_id HAVING count(*) > 1;
 registrar_id | semester_id | count
--------------+-------------+-------
 1174000722   |          25 |     2
(1 row)

More details:

> newice::COPPER=> select * from course_selection_course where registrar_id = '1174000722';
id  |           title            |                                                                                                                                                                                                                              description                                                                                                                                                                                                                              | registrar_id | semester_id | rating
------+----------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------+-------------+--------
 3903 | Topics in Modern Astronomy | This course provides a broad overview of modern astronomy and astrophysics for students in the sciences. Emphasis is on the application of basic physics to understanding of astronomical systems. Topics include the Solar System; planetary systems and exoplanets; the birth, life, and death of stars; white dwarfs, neutron stars, and black holes; the Milky Way and distant galaxies; cosmology, dark matter and dark energy, and the history of the Universe. | 1174000722   |          25 |      0
 3902 | Topics in Modern Astronomy | This course provides a broad overview of modern astronomy and astrophysics for students in the sciences. Emphasis is on the application of basic physics to understanding of astronomical systems. Topics include the Solar System; planetary systems and exoplanets; the birth, life, and death of stars; white dwarfs, neutron stars, and black holes; the Milky Way and distant galaxies; cosmology, dark matter and dark energy, and the history of the Universe. | 1174000722   |          25 |      0

AST204 comes up twice in search results! And can sign up for it twice: image

Dyland says:

i think this is a one time bug the race condition must have happened the first time we ran the scraper because we had to create the course twice

Naphat:

i think a transaction would have prevented this https://docs.djangoproject.com/en/1.10/topics/db/transactions/

Resolution: torch one of the two classes

maximz commented 7 years ago

We need to consolidate two classes into one. We may have to do this again in the future, so recording how to do this here.

The hard part of this is that enrollments are stored as json blobs. We have to parse and correct these so that some students aren't left enrolled in a class that no longer exists. While we're at it, we should fix their enrollments to point to the right class rather than deleting them altogether.

Here's what enrollment json looks like:

[
  {
    "course_id": 2434,
    "color": {
      "dark": "#614996",
      "id": 7,
      "light": "#dcd5e2",
      "resource_uri": "/api/v1/color_palette/7/"
    },
    "sections": [
      5854,
      5856
    ]
  },
  {
    "course_id": 2547,
    "color": {
      "dark": "#2f7770",
      "id": 6,
      "light": "#d1e7e4",
      "resource_uri": "/api/v1/color_palette/6/"
    },
    "sections": [

    ]
  },
  {
    "course_id": 2596,
    "color": {
      "dark": "#385C92",
      "id": 2,
      "light": "#d5dcec",
      "resource_uri": "/api/v1/color_palette/2/"
    },
    "sections": [
      6233
    ]
  }
]

Code to rewrite enrollments:

import json
from django.db.models import Q
from course_selection.models import *
course_num = 3903 # old course ID -- the one to be destroyed
correct_course_num = 3902 # new course ID

section_mapping = {9179:9178} # map from old to new section
filter_str = '"course_id":%d,' % course_num 
matching_schedules = Schedule.objects.filter(Q(enrollments__icontains=filter_str)) # search the json

for sched in matching_schedules:
    # first, check if user happens to be enrolled in both the right and the wrong version of this class!
    # don't want to double enroll in correct class
    already_enrolled_in_right_class = (correct_course_num in [course_obj['course_id'] for course_obj in json.loads(sched.enrollments)]) 

    # rewrite course objects
    new_course_objs = []
    for course_obj in json.loads(sched.enrollments):
        if course_obj['course_id'] == course_num:
            if already_enrolled_in_right_class:
                continue # don't fix this wrong record, just delete it entirely, because already enrolled in right class
            new_sections = []
            for section_id in course_obj['sections']:
                new_sections.append(section_mapping.get(section_id, section_id)) # remap section id if it needs to be remapped
            course_obj['sections'] = new_sections
            course_obj['course_id'] = correct_course_num
        new_course_objs.append(course_obj)
    sched.enrollments = json.dumps(new_course_objs)
    sched.save()

In this case, len(matching_schedules) was 22.

Then, destroy bad course and everything associated with it:

wrong_course = Course.objects.get(id=course_num)
wrong_course.delete()

this will automatically kill: Section (--> Meeting), Course_Listing, per https://docs.djangoproject.com/en/1.10/topics/db/queries/:

When Django deletes an object, by default it emulates the behavior of the SQL constraint ON DELETE CASCADE – in other words, any objects which had foreign keys pointing at the object to be deleted will be deleted along with it.

How we found those course and section IDs: course IDs are from above, and section IDs can be found as follows.

select * from course_selection_section where course_id=3903;
  id  | name | isDefault | section_type | course_id | section_capacity | section_enrollment | section_registrar_id 
------+------+-----------+--------------+-----------+------------------+--------------------+----------------------
 9179 | L01  | f         | LEC          |      3903 |               70 |                  0 | 40368
(1 row)

select * from course_selection_section where course_id=3902;
  id  | name | isDefault | section_type | course_id | section_capacity | section_enrollment | section_registrar_id 
------+------+-----------+--------------+-----------+------------------+--------------------+----------------------
 9178 | L01  | f         | LEC          |      3902 |               70 |                  0 | 40368
(1 row)

We chose 3902 to be the canonical course, and we blew away 3903. Thus 9178 is the canonical section, and 9179 is wrong.

testing on dev tier

To test on dev tier, we chose the classes ast204, ast403. On test-recal these have course IDs 3935, 3936 and section IDs are 9229, 9230 respectively. This was found as follows:

test-recal::AQUA=> select * from course_selection_semester;
 id | start_date |  end_date  | term_code 
----+------------+------------+-----------
  1 | 2015-09-16 | 2016-01-31 | 1162
  2 | 2016-02-01 | 2016-05-31 | 1164
  3 | 2016-09-14 | 2017-02-05 | 1172
  4 | 2017-02-06 | 2017-06-06 | 1174
# now we know it's semester 4.
select id from course_selection_course where title = 'Topics in Modern Astronomy' and semester_id=4;
select id from course_selection_section where course_id=3935;
select id from course_selection_section where course_id=3936;

Here are the three cases tested:

I created three schedules matching those cases, and ran with these values:

course_num = 3936 # old
correct_course_num = 3935 # new
section_mapping = {9230:9229} # from wrong to right numbers

All three schedules as a result only had ast204.

And the destruction worked:

wrong_course = Course.objects.get(id=course_num)
wrong_course.delete()

After this, in incognito, we could no longer add the class. (Although on first run in incognito, we got 503s from loading the courses api -- maybe this was cache clearing on backend? Needs investigation.)

The associated sections got removed, as Django promised: select * from course_selection_section where course_id=3936; returns empty.

deploying

When we ran on prod in python manage.py shell, in incognito we could search for the old course. The fix was to run clear_cache to purge the server-side cache. But some users might have local cache of the courses API. We should monitor to make sure no one signs up for the missing class later. In the future, we should cache bust API urls.

Finally, scraping works again:

...
Scraping for semester 1174
0 new courses
1663 total courses
0 new sections
3068 total sections
3093 new meetings
3093 total meetings
0 new professors
2034 total professors
0 new listings
2979 total listings
----------------------------------
course selection: courses scraped successfully
course selection: cache regenerated for term 1162
course selection: cache regenerated for term 1164
course selection: cache regenerated for term 1172
course selection: cache regenerated for term 1174
course selection: courseapi cache cleared.