jkeesh / scpd-scraper

download and convert SCPD lectures easily
17 stars 17 forks source link

re.search returns None #12

Open danfrankj opened 10 years ago

danfrankj commented 10 years ago

I'm currently trying to scrape Natural Language Processing and scrape.py fails when it executes re.search near line 105. I'll continue to try to debug but am pretty new at this.

jkeesh commented 10 years ago

hm yeah, ive accepted a lot of pull requests recently and havent been testing it for a while myself. there are others who have contributd. if you find a bug and a fix, just send a pull

On Wed, Sep 25, 2013 at 5:26 PM, danfrankj notifications@github.com wrote:

I'm currently trying to scrape Natural Language Processing and scrape.py fails when it executes re.search near line 105. I'll continue to try to debug but am pretty new at this.

— Reply to this email directly or view it on GitHubhttps://github.com/jkeesh/scpd-scraper/issues/12 .

danfrankj commented 10 years ago

I think it's more than a simple bug. I think something changed with scpd. I'll try to debug...

djoeman84 commented 10 years ago

I'll get to it this weekend

adotey commented 10 years ago

It looks like previously, when reaching the course page, the script searched the WMP links and parsed out a useful part (a link?) that it could simply open. Now, the WMP links are calls to Javascript functions that generate the correct URL. It's easy to construct the correct URL the same way the Javascript does, except for an authentication parameter ("slp") that's passed. If we could figure that out, this issue could be fixed.

Example HTML link:

<a href='javascript:openSL("509d37b5-f858-474c-9876-daa31c7346bb","CS221","cab421d9-1581-4f8c-989a-9cc3fdb2833d","130923","","WA","&wmp=true");'>WMP</a>

openSL (copied from chrome web inspector):

//need to go update openSL to only use one param
function openSL(collGuid, courseName, coGuid, lectureName, lectureDesc, desiredAuthType, playerType) {
            reqObj = 'coll=' + collGuid + '&course=' + courseName + '&co=' + coGuid + '&lecture=' + lectureName;
//                  CourseGUIDStr + MyCollection.Name + co.GUID + co.Name + lectureType + desiredAuthType_PARAM
        if (lectureDesc == "problem session")
            reqObj += '&lectureType=ps';
        reqObj += '&authtype=' + desiredAuthType;
        PageMethods.playSLVideo(collGuid, coGuid, desiredAuthType, function (slphash) {
            if (slphash != null) {
                reqObj += '&slp=' + slphash + playerType; 
                var win = window.open('http://myvideosv.stanford.edu/' + 'player/slplayer.aspx?' + reqObj);
                win.focus
            } // End if

        }  //End PageMethodsParameter
        );  //End PageMethods
        }   //End OpenSL

The correct URL: http://myvideosv.stanford.edu/player/slplayer.aspx?coll=509d37b5-f858-474c-9876-daa31c7346bb&course=CS221&co=cab421d9-1581-4f8c-989a-9cc3fdb2833d&lecture=130923&authtype=WA&slp=Tt1WYJGAboTOWN9TCu6QwEY%2bmNI%3d&wmp=true

The value of slp changes every time you click the (Javascript) link.

jkeesh commented 10 years ago

ah hm. yes that would break it. if you have a fix ill merge it in.--- seems like this project still gets a good amount of usage.

On Fri, Oct 4, 2013 at 4:57 AM, adotey notifications@github.com wrote:

It looks like previously, when reaching the course page, the script searched the WMP links and parsed out a useful part (a link?) that it could simply open. Now, the WMP links are calls to Javascript functions that generate the correct URL. It's easy to construct the correct URL the same way the Javascript does, except for an authentication parameter ("slp") that's passed. If we could figure that out, this issue could be fixed.

Example HTML link:

WMP

openSL (copied from chrome web inspector):

//need to go update openSL to only use one paramfunction openSL(collGuid, courseName, coGuid, lectureName, lectureDesc, desiredAuthType, playerType) { reqObj = 'coll=' + collGuid + '&course=' + courseName + '&co=' + coGuid + '&lecture=' + lectureName;// CourseGUIDStr + MyCollection.Name + co.GUID + co.Name + lectureType + desiredAuthType_PARAM if (lectureDesc == "problem session") reqObj += '&lectureType=ps'; reqObj += '&authtype=' + desiredAuthType; PageMethods.playSLVideo(collGuid, coGuid, desiredAuthType, function (slphash) { if (slphash != null) { reqObj += '&slp=' + slphash + playerType; var win = window.open('http://myvideosv.stanford.edu/' + 'player/slplayer.aspx?' + reqObj); win.focus } // End if

    }  //End PageMethodsParameter
    );  //End PageMethods
    }   //End OpenSL

The correct URL:

http://myvideosv.stanford.edu/player/slplayer.aspx?coll=509d37b5-f858-474c-9876-daa31c7346bb&course=CS221&co=cab421d9-1581-4f8c-989a-9cc3fdb2833d&lecture=130923&authtype=WA&slp=Tt1WYJGAboTOWN9TCu6QwEY%2bmNI%3d&wmp=true

The value of slp changes every time you click the (Javascript) link.

— Reply to this email directly or view it on GitHubhttps://github.com/jkeesh/scpd-scraper/issues/12#issuecomment-25684519 .

adotey commented 10 years ago

Unfortunately I don't. There's some authentication hash (called "slphash" in the openSL code) being generated and inserted into the URL as a required parameter (slp), but I don't know how it's being generated. Hopefully it's something you or someone else could figure out.

jkeesh commented 10 years ago

Yeah I don't use it anymore, but help merge pull requests since a bunch of people were still using it

— Jeremy

On Fri, Oct 4, 2013 at 9:04 PM, adotey notifications@github.com wrote:

Unfortunately I don't. There's some authentication hash (called "slphash" in the openSL code) being generated and inserted into the URL as a required parameter (slp), but I don't know how it's being generated. Hopefully it's something you or someone else could figure out.

Reply to this email directly or view it on GitHub: https://github.com/jkeesh/scpd-scraper/issues/12#issuecomment-25739163

djoeman84 commented 10 years ago

This looks tricky- we could try to use another library which allows js calls, but I think, although complicated, this js can be picked apart and replaced with python code.

adotey commented 10 years ago

Someone else has a working (Ruby) script: https://github.com/dennybritz/scpd-downloader

He gets the slp hash by issuing a json request for it.

djoeman84 commented 10 years ago

I made a quick bookmarklet to get the video URL if any of you are interested. http://joon-tech.blogspot.com/2013/10/in-order-to-help-with-issues-with-scpd.html Also there is an accompanying Gist https://gist.github.com/djoeman84/7140185

osdiab commented 10 years ago

Any update on this? Not much in the way of recent commits, is anyone working on this?