Closed rchatrath7 closed 7 years ago
Rakesh, my scraping is probably a bit messed up here. I'll try and take a look at it when I have a bit of time in the coming week. In the meantime, do you want to take a stab at it?
I have a basically duct taped commit that fixes the issue - I'll open a PR for it in the next few days. Only thing I'm concerned with is that due to the weird inconsistencies of Pitt's tables, there are like two random edge cases that I had to check with nasty conditionals (I'll try to figure out a way to improve the scraping so those aren't a problem sometime in the near future).
I found a solution that has been working for me.
Get rid of the "and len(course_detail.string.strip()) > 2]" line and parse the course information like this:
course_details.append( { 'subject': details[1], 'catalog_number': details[3], 'term': details[5].replace('\r\n\t', ' '), 'class_number': course.find('a').contents[0], 'title': details[8], 'instructor': details[10], 'credits': details[12] }
I think all the courses are parsed like this: ['', 'CS', '', '0449', '', '2171 \xa0AT', '', '', 'Intro To Systems Software', '', 'Misurda,Jonathan R', '', '3 cr.', '']
@atheodule That looks good to me, I ended up taking a different approach, but I like yours better. I was thinking that for cleanliness, we could replace the empty strings (in cases when there's an instructor or what have you) with "Not Decided" or something. I'll open a PR for a modified version of what I'm talking about after testing it against every course (I wrote some code to grab every subject, so it shouldn't be too hard to check :3)
Awesome, thanks guys!
For some subjects,
grabs the subject at index 0. This causes the returned list of
course_details
to be incorrect, since with the current conditionals, it's setup to have thecatalog_number
be set to what would then be the subject. This was found for every subject except 4. For now, an easy fix is to check lists of length 6, and accommodate, but the underlying reason for why certain subjects (CS, JS, SA, FP) return lists of length 5 instead of 6 (ie not including the subject in the parsed result).