ryanhugh / searchneu

Search over Classes, Professors and Employees at NEU!
https://searchneu.com
GNU Affero General Public License v3.0
74 stars 18 forks source link

Integrate TRACE prof reviews #3

Open julian-zucker opened 6 years ago

julian-zucker commented 6 years ago

You can find the data here.

ryanhugh commented 6 years ago

Awesome!!!!!! I was totally planning on writing scrapers for the TRACE surveys haha. This will help a lot! :D

Right now, all the scrapers on Search NEU are ran daily so the data is never more than a day old. I think a good plan would be to incorporate some of your code into Search NEU so the TRACE survey data is automatically updated too.

julian-zucker commented 6 years ago

Sounds good, a few things to note:

The scraper requires Selenium to be run headfully, which requires lots of CPU + RAM which might be expensive or break things if you're running the scrapers on a EC2 instance or similar. The scraper also takes around 24 hours to run. Also, it requires exploiting a security vulnerability on shibboleth to get access to TRACE. I won't post the URL I used to login to trace because it also allows people to log into every website behind a shibboleth login, but if you're interested in scraping the data automatically I can show you how to generate one for your own husky email.

ryanhugh commented 6 years ago

lots of CPU + RAM ... and takes 24 hours to run

The current scrapers for Search NEU run on Travis-CI ( https://travis-ci.org/ryanhugh/searchneu/builds ) as a daily cron job. All of the jobs on that link that are running the prod branch and are marked as cron jobs are the scrapers running. The ones that failed are usually because Travis killed the job because it used too much RAM or NEU's site was down.

The Search NEU scrapers usually take 15-30 min to run. Travis has a limit of 3GB of RAM and 50 min of running time (if exceeded, it will kill the job/process). The current scrapers send about 100k HTTP requests and download just over 1GB of HTML. It sends out about 100 requests per second on average. I've done a lot of optimizations to make this super fast :D

Not 100% sure what the plan is yet but I might try to write some myneu/TRACE scrapers with raw HTTP requests (request/cheerio) instead of incorporating selenium just to keep the scraping time down. Definitely interesting and usefull to see how other people have gone about solving this challenge!

exploiting a security vulnerability I can show you how to generate one for your own husky email.

Definitely interested in the details behind this, but it sounds like we should chat privately. Shoot me a FB request at https://www.facebook.com/ryan.hughes.35 !

dajinchu commented 5 years ago

I don't mean to resurrect such an old thread but I was wondering if the TRACE integration went anywhere? Was about to build myself a TRACE viewer when I checked to see if anyone already did it, and found this thread.

ryanhugh commented 5 years ago

TBH this hasn't gone anywhere for a while. If you are interested in building one i'd be happy to give some pointers

dajinchu commented 5 years ago

That would be really helpful. Do you know if the only way to scrape TRACE is through the security vulnerability Julian mentioned?

edward-shen commented 5 years ago

If anything, you won't get anything by asking nicely:

Hello Edward,

Sorry for the delay in response, I was gathering information on our privacy policy in regards to your request. Unfortunately, we cannot give you access to the information you are seeking for this project due to privacy restrictions. Student use is regulated to seeing instructors past evaluations so they can be able to know which class they’d like to take with who. Analyzing the data the way you are inquiring is only allowed for a limited few and cannot be extended to students.

I’m sorry I cannot be of more help. I wish you luck in your studies.

I suppose someone could theoretically just run a web driver that logins to trace with their credentials, theoretically scrape the "View" link for the 3 sp GET parameters for each row, and theoretically get the excel file for that class from /eval/new/showreport/excel?r=2&c=xxxxx&i=xxxxx&t=xxxxx&d=false, replacing the c, i, and t values with the 3 sp values respectively, and theoretically parse the excel sheet since it has formatted data.

Of course, I haven't tried it out myself nor would I ever condone doing something that breaks their AUP/TOS, but theoretically one could perform the above and get the results that you want.

ryanhugh commented 5 years ago

One of the big hurtles with scraping TRACE info is that students have to log in to view them. If you write scrapers to solve this, your scrapers are going to have to have access to a student's username and password to login to access the TRACE surveys.

I would say having the scrapers login from the first login page, login, and then scrape the TRACE surveys every time (instead of trying to save the login tokens (We don't know how long the tokens last, we don't know if they will be revoked some time in the future, etc)

Julian's method allowed him to skip a few steps with the login process, but the source code would still have to have access to a token associated with a NEU login account. I'm not sure how reliable this trick will be in the future so I'm not too sure I would recommend using it for a long-term project.

If you do do this, please take care and don't post the output of the scrapers publicly. Also please be mindful of NEU's policies. Feel free to message us over FB for more details.

https://m.me/ryanhughez

dajinchu commented 5 years ago

Thanks for the info Eddy. I definitely won't try that out.

Ryan, I figured the logins would be an issue... Do Northeastern policies mean that we will never see a searchneu with TRACE integrated?

edward-shen commented 5 years ago

Not without requiring the users to sign in with NEU's SSO, which might just be oauth2 or might also be something else.

ryanhugh commented 5 years ago

We would have to get their approval before we gather any pieces of information that students have to log in to view (stuff that is inside of MyNEU) and add it to search neu. There is no guarantee that we will be able to get this, but this is something I can work on if you want to work on scraping stuff!

dajinchu commented 5 years ago

Built this last weekend at HackDartmouth: https://trace.dajinchu.now.sh/ Development is ongoing and it might be of interest to y'all.

edward-shen commented 5 years ago

Nice! I had something in the works as well for TRACE (edward-shen/tea) but development stopped just before I implemented searching.

ryanhugh commented 5 years ago

Super cool - great work!! I'm trading a few emails with NEU ITS now, would love to get approval from NEU so we can post this stuff publicly.