rosshamish / classtime

university schedule generation and course data API
MIT License
16 stars 2 forks source link

UAlberta database might have some invalid data. What should we do? #30

Closed rosshamish closed 9 years ago

rosshamish commented 9 years ago

It seems that Mec E 265's lab sections have some NULL values

...
day: NULL,
startTime: NULL,
endTime: NULL

This means we can't put them on the timetable because they don't have a logical place on it.

I noticed these "NULL-day/time-attributed"/"ghost" sections a week or two ago. They were usually not-consistently-meeting sections, like "anytime help rooms" in first year Chem. So, I handled that case by ignoring the section when checking for conflicts.

But, this MEC E 265 LAB is not a "do any time you want" section: my Mec E friend is taking it right now. I looked on beartracks and the day/time for each lab was valid and visible. So, it's possible that the database we access is not the same as the database beartracks accesses. This means we can't guarantee our data is complete anymore.

If there is a small number of these incorrect/invalid entries, we could log each time it happens, then review logs to identify these problem sections as they are found by users (or us, we could automate the process actually). We could then manually get the correct data from beartracks, and put an "exception list" for downloading sections - when it comes from the server, this manually-found data is appended.

We could also try to contact someone at the U to see if we could contribute back to the DB, or file a support ticket or something.

Thoughts?

rosshamish commented 9 years ago

Looks like Chem 103 Labs have a similar issue.

class section component day startTime endTime
60881 T01 LAB W 08:00 AM 10:50 AM
60883 T02 LAB W 08:00 AM 10:50 AM
60885 N01 LAB M 08:00 AM 10:50 AM
60887 N02 LAB {null} {null} {null}
60889 Y01 LAB F 08:00 AM 10:50 AM
60891 Y02 LAB F 08:00 AM 10:50 AM
60893 U01 LAB W 02:00 PM 04:50 PM
60895 U02 LAB W 02:00 PM 04:50 PM
60897 Q01 LAB M 02:00 PM 04:50 PM
60899 Q02 LAB {null} {null} {null}
60901 Z01 LAB F 02:00 PM 04:50 PM
60903 Z02 LAB F 02:00 PM 04:50 PM
64603 F01 LAB T 06:00 PM 08:50 PM
64605 F02 LAB {null} {null} {null}
71331 V01 LAB W 06:00 PM 08:50 PM
71333 V02 LAB W 06:00 PM 08:50 PM

This is doubly worse because the {null} entries tend to get scheduled more often, since they don't actually conflict with anything.

I was hoping to not find any of these cases, but it seems they must be fairly common. Do you think its best to fix these entries manually? Or should we just ignore {null} sections?

rosshamish commented 9 years ago

After applying this bandaid-fix, the test case "Taylor Rault's 2nd Year Fall Term 2014" fails to find any schedules. This is bad.

ahoskins commented 9 years ago

I don't know what the best thing to do is. We have to talk about this asap though. (It's the kind of thing that's best sorted out while we're both living in edmonton also)

ahoskins commented 9 years ago

SO we don't want to ignore sections. It's definitely important to consider every course offered when making a schedule. I wonder if these NULL's only happen on labs when there are repeats. And I don't know if I'm just being bad, but where can I find this section data? It doesn't seem to be in my local db in /tmp/.

If it only happens on repeat labs (like it does in the example you posted above), then thats way less of a problem. Like, in the above, you could infer the start and end times of each based on the entry prior to it. But if this happens to random lecture sections, then we obviously have a much more serious bad bad problem.

So to summarize: Have you checked out if there are NULL entries on lectures (or any random section we couldn't infer)? And also, where can I find section times and days so I can look for myself?

Thx cool cool.

ahoskins commented 9 years ago

Conclusions:

Email someone from the University of Alberta asking on the validity of the LDAP server. Just ask about how valid it is, don't go complaining to them with our hopes and dreams for the LDAP server.

Carry on with front-end and back-end work. Much of the functionality is interoperable anyways.

rosshamish commented 9 years ago

@ahoskins So yesterday while I was refactoring, I was forced to read all the old AcademicCalendar and LDAPDatabaseConnection code.

Because of that, I had an epiphany last night + this morning.

The AcademicCalendar, when it searches for sections in a course, is forced to grab timetable information from a LDAP classtime objects. The documentation says "Typically, each class will have zero or one but some have many".

This is the code that fetches the timetable info:

            if len(classtimes) == 1:
                classtime = classtimes[0]
            else:
                classtime = dict()
            section['day'] = classtime.get('day')
            section['location'] = classtime.get('location')
            section['startTime'] = classtime.get('startTime')
            section['endTime'] = classtime.get('endTime')

It does not account for more than one classtime. It considers the case of 1 classtime, and treats everything else as the case of 0 classtimes.

I dropped into the debugger in that "else" case, and checked out what was being caught by that else case. As expected, it caught the ENGG 299 sections which aren't on the timetable.

It also caught a couple like this:

(Pdb) section.get('asString')
u'MEC E 265 LAB D2'
(Pdb) import pprint;pprint.pprint(classtimes)
[{'day': u'W',
  'endTime': u'04:50 PM',
  'location': u'MEC 3 1',
  'startTime': u'02:00 PM'},
 {'day': u'W',
  'endTime': u'04:50 PM',
  'location': u'MEC 3 28',
  'startTime': u'02:00 PM'},
 {'day': u'W',
  'endTime': u'04:50 PM',
  'location': u'MEC 3 3',
  'startTime': u'02:00 PM'}]

This is the Mec E 265 Lab referenced earlier in this thread. It doesn't have null timetable info, it has extra timetable info.

Similarly, here are the Chem 103 Lab sections referenced earlier in the thread:

(Pdb) section['asString']
u'CHEM 103 LAB F02'
(Pdb) classtimes
[{'endTime': u'08:50 PM', 
  'day': u'T', 
  'startTime': u'06:00 PM'},
 {'endTime': u'01:00 AM', 
  'day': u'W', 
  'startTime': u'01:00 AM'}]

Same thing. It has extra timetable info. In this case, however, it looks like garbage data - a class from 1:00AM to 1:00AM on Wednesday makes no sense.

Conclusions

There may still be missing timetable data for some sections. However, at least some of the data we thought was missing is in fact not missing. It just has extraneous information, and the code is ignoring it.

So, the problem is partly on our end. This is a really good thing.

I will push a fix sometime that detects multiple classtime objects, and uses the first classtime object of the list instead of ignoring them all. This might not be perfect (maybe there is special meaning to having multiple classtimes?), but it is certainly better than not scheduling them at all.

Pce out.