niklasb / dryscrape

[not actively maintained] A lightweight Python library that uses Webkit to enable easy scraping of dynamic, Javascript-heavy web pages
http://dryscrape.readthedocs.io/
MIT License
533 stars 67 forks source link

Strange behavior on opening complicated url #47

Closed SKART1 closed 8 years ago

SKART1 commented 8 years ago

I use dryscrape v1.0 as instrument to download stack traces from Google Play. I have downloaded whole crashes report for some period of time, and wanted to download page with stack traces.

And I have met a strange behavior - when url is:

https://play.google.com/apps/publish?dev_acc=18149679673077794436#ErrorClusterDetailsPlace:p=com.android&lr=LAST_6_MONTHS&sh=false&s=new_status_desc&ed=1454285015339&et=CRASH&ecn=java.lang.NullPointerException&tf=Uri.java&tc=android.net.Uri$StringUri&tm=%3Cinit%3E

It opens https://play.google.com/apps/publish/?dev_acc=18149679673077794436#AppListPlace instead of url above.

But at the same time if url is:

https://play.google.com/apps/publish/?dev_acc=18149679673077794436#ErrorClusterDetailsPlace:p=com.android&lr=LAST_6_MONTHS&sh=false&s=new_status_desc&ed=1454285015339&et=CRASH&ecn=java.lang.NullPointerException&tf=Uri.java&tc=android.net.Uri$StringUri&tm=%3Cinit%3E

(differs in slash before question mark) - all works normal

Code is:

def downloadStackTraceByLink(link, session, i):
    # some black magic
    #if link.find("publish/") == -1:
    #   link = link.replace("publish", "publish/")

    session.visit(link)

    # sleep a bit to leave the mail a chance to open.
    # This is ugly, it would be better to find something
    # on the resulting page that we can wait for
    time.sleep(10)

    if link != session.url():
        print("WTF DUDE! Current link is: " + session.url() + "\n but was " + link)
    else:
        print("Ok " + str(i))

    session.driver.render('screenshot ' + str(i) + '.jpg')

When login code is:

from dryscrape import dryscrape

class SessionGoogle:
    def __init__(self, url_login, login, passwd):
        self.ses = dryscrape.Session()
        self.ses.visit(url_login)

        login = self.ses.at_xpath('//*[@id="Email"]').set(login)
        password = self.ses.at_xpath('//*[@id="Passwd"]').set(passwd)

        login_button = self.ses.at_xpath('//*[@id="signIn"]').click()
        self.ses.driver.render('login_result.png')

    def getSes(self):
        return self.ses

url_login = "https://accounts.google.com/ServiceLogin"

niklasb commented 8 years ago

Are you sure that is not just an HTTP redirect? You can use a proxy such as burp suite or fiddler to check the exact response. On Feb 10, 2016 5:00 PM, "Art" notifications@github.com wrote:

I use dryscrape v1.0 as instrument to download stack traces from Google Play. I have downloaded whole crashes report for some period of time, and wanted to download page with stack traces.

And I have met a strange behavior - when url is:

https://play.google.com/apps/publish?dev_acc=18149679673077794436#ErrorClusterDetailsPlace:p=com.android&lr=LAST_6_MONTHS&sh=false&s=new_status_desc&ed=1454285015339&et=CRASH&ecn=java.lang.NullPointerException&tf=Uri.java&tc=android.net.Uri$StringUri&tm=%3Cinit%3E

It opens https://play.google.com/apps/publish/?dev_acc=18149679673077794436#AppListPlace instead of url above.

But at the same time if url is:

https://play.google.com/apps/publish/?dev_acc=18149679673077794436#ErrorClusterDetailsPlace:p=com.android&lr=LAST_6_MONTHS&sh=false&s=new_status_desc&ed=1454285015339&et=CRASH&ecn=java.lang.NullPointerException&tf=Uri.java&tc=android.net.Uri$StringUri&tm=%3Cinit%3E

(differs in slash before question mark) - all works normal

Code is:

def downloadStackTraceByLink(link, session, i):

some black magic

#if link.find("publish/") == -1:
#   link = link.replace("publish", "publish/")
session.visit(link)

# sleep a bit to leave the mail a chance to open.
# This is ugly, it would be better to find something
# on the resulting page that we can wait for
time.sleep(10)

if link != session.url():
    print("WTF DUDE! Current link is: " + session.url() + "\n but was " + link)
else:
    print("Ok " + str(i))

session.driver.render('screenshot ' + str(i) + '.jpg')

When login code is:

from dryscrape import dryscrape

class SessionGoogle: def init(self, url_login, login, passwd): self.ses = dryscrape.Session() self.ses.visit(url_login)

    login = self.ses.at_xpath('//*[@id="Email"]').set(login)
    password = self.ses.at_xpath('//*[@id="Passwd"]').set(passwd)

    login_button = self.ses.at_xpath('//*[@id="signIn"]').click()
    self.ses.driver.render('login_result.png')

def getSes(self):
    return self.ses

— Reply to this email directly or view it on GitHub https://github.com/niklasb/dryscrape/issues/47.

SKART1 commented 8 years ago

I will check and report, but putting only one slash differs all

niklasb commented 8 years ago

Yes but that might that be due to what the web server does with your request. Doesn't look like a bug in dryscrape.

On Wed, Feb 10, 2016, 18:17 Art notifications@github.com wrote:

I will check and report, but putting only one slash differs all

— Reply to this email directly or view it on GitHub https://github.com/niklasb/dryscrape/issues/47#issuecomment-182488501.

SKART1 commented 8 years ago

I have used the the same url in "adult" browsers - firefox and chromium - all works, no redirect were detected (visually I stayed on desired page)...

May be this is not dryscrape fault - but let me test with proxies

SKART1 commented 8 years ago

Yes, you were right - it is html refdirect:

<HTML>
  <HEAD>
    <TITLE>Moved Temporarily</TITLE>
  </HEAD>
  <BODY BGCOLOR="#FFFFFF" TEXT="#000000">
    <H1>Moved Temporarily</H1>
    The document has moved
    <A HREF="https://play.google.com/apps/publish/?dev_acc=18149679673077794436">here</A>
    .</BODY>
</HTML>

But very strange that after redirect I am receiving 403 forbidden error! I can send you proxy dumps if you are interested in this problem

If I go directly to redirected page - all is ok...

niklasb commented 8 years ago

Maybe you are not logged in properly? The page you linked does not seem to be publicly accessible. Anyway, since this has nothing to do with dryscrape, I'm closing this issue.

SKART1 commented 8 years ago

No, I am logged because I can go just to the same url with one slash difference