pittcsc / PittAPI

An API to easily get data from the University of Pittsburgh
https://pittapi.pittcsc.org
GNU General Public License v2.0
108 stars 33 forks source link

Course Information is incorrect for detail lists of length 6 #8

Closed rchatrath7 closed 7 years ago

rchatrath7 commented 7 years ago

For some subjects,

    details = [course_detail.string.replace(' ', '').strip()
                       for course_detail in course
                       if course_detail.string is not None
                       and len(course_detail.string.strip()) > 2]

grabs the subject at index 0. This causes the returned list of course_details to be incorrect, since with the current conditionals, it's setup to have the catalog_number be set to what would then be the subject. This was found for every subject except 4. For now, an easy fix is to check lists of length 6, and accommodate, but the underlying reason for why certain subjects (CS, JS, SA, FP) return lists of length 5 instead of 6 (ie not including the subject in the parsed result).

screen shot 2017-01-13 at 7 25 10 pm
RitwikGupta commented 7 years ago

Rakesh, my scraping is probably a bit messed up here. I'll try and take a look at it when I have a bit of time in the coming week. In the meantime, do you want to take a stab at it?

rchatrath7 commented 7 years ago

I have a basically duct taped commit that fixes the issue - I'll open a PR for it in the next few days. Only thing I'm concerned with is that due to the weird inconsistencies of Pitt's tables, there are like two random edge cases that I had to check with nasty conditionals (I'll try to figure out a way to improve the scraping so those aren't a problem sometime in the near future).

atheodule commented 7 years ago

I found a solution that has been working for me.

Get rid of the "and len(course_detail.string.strip()) > 2]" line and parse the course information like this:

course_details.append( { 'subject': details[1], 'catalog_number': details[3], 'term': details[5].replace('\r\n\t', ' '), 'class_number': course.find('a').contents[0], 'title': details[8], 'instructor': details[10], 'credits': details[12] }

I think all the courses are parsed like this: ['', 'CS', '', '0449', '', '2171 \xa0AT', '', '', 'Intro To Systems Software', '', 'Misurda,Jonathan R', '', '3 cr.', '']

rchatrath7 commented 7 years ago

@atheodule That looks good to me, I ended up taking a different approach, but I like yours better. I was thinking that for cleanliness, we could replace the empty strings (in cases when there's an instructor or what have you) with "Not Decided" or something. I'll open a PR for a modified version of what I'm talking about after testing it against every course (I wrote some code to grab every subject, so it shouldn't be too hard to check :3)

RitwikGupta commented 7 years ago

Awesome, thanks guys!