Figure out how to automatically scrape

domluna commented 10 years ago

The YorkU website is literally a clusterfuck for scraping, but it would be really awesome if we could automatically do it. I'm not even sure if this is completely possible due to the absurd html layout and the fact that the urls don't make any sense.

Accounting - https://w2prod.sis.yorku.ca/Apps/WebObjects/cdm.woa/20/wo/2Ut0tG0DUArPP653ACehWw/1.1.10.7 Biology -https://w2prod.sis.yorku.ca/Apps/WebObjects/cdm.woa/20/wo/2Ut0tG0DUArPP653ACehWw/1.1.10.7

Notice they're the same url! WTF!

Also I think it's putting cookies in the url because these urls will expire after a short while.

Anyway the html soup can be dealt with it's the url structure not making any sense that worries me. The structure we would want would be something like

https://www.yorku.ca/courses/2014-15/{Term}/{Subject}

but I guess that would make too much sense.

rajiteh commented 9 years ago

@domluna Hello! I think the issue here is that York uses Apple webobjects, a framework that saw it's last release like 6 years ago. x(

Upon investigating the URL structure, and reading some ancient documentation, the apparent garbage in the URL seems to contain a wosid (WebObjects session ID) that uniquely identifies each user and their context, i.e: current page. This token seems to be only generated upon requesting the root of the app and gets expired pretty fast.

I was able to get a proof of concept of fully automated parsing by improving @mlisbit native parser script. See: #2

Cheers!

domluna commented 9 years ago

Geez 6 years! That's way before my time. If you found a way to around all of that, well that's just wonderful.

So then if there a way uniquely identify a page consistently?

So as an example say it encodes anthropology page as 10.2.3 or something like that. Does it always give back 10.2.3?

If it has this property, we can make a mapping from the course type to the weird encoding and mine the pages that way. — Sent from Mailbox

On Thu, Sep 18, 2014 at 3:56 PM, Rajitha Perera notifications@github.com wrote:

@domluna Hello! I think the issue here is that York uses Apple webobjects, a framework that saw it's last release like 6 years ago. x( Upon investigating the URL structure, and reading some ancient documentation, the apparent garbage in the URL seems to contain a wosid (WebObjects session ID) that uniquely identifies each user and their context, i.e: current page. This token seems to be only generated upon requesting the root of the app and gets expired pretty fast. I was able to get a proof of concept of fully automated parsing by improving @mlisbit native parser script. See: #2

Cheers!

Reply to this email directly or view it on GitHub: https://github.com/mlisbit/openYorkU-API/issues/1#issuecomment-56093085

rajiteh commented 9 years ago

@domluna Those seem to be consistent, however there is no guarantee that they will stay the same in the future.

For example, the endpoint '1.1.10.7' seems to be a method accepting two POST variables

sessionPopUp & subjectPopUp that defines the semester and subject category. This information itself should be enough to get a working parser, at least from the way it's structured at the moment.

mlisbit / openYorkU-API

Figure out how to automatically scrape #1

Cheers!