trthanhquang / wayback-data-collector

Software Data Collection Using Wayback Machine
Apache License 2.0
2 stars 0 forks source link

Browser does not support frames #5

Open bilun167 opened 10 years ago

bilun167 commented 10 years ago

Crawled HTML did not contain desired information. A different HTML containing "Browser does not support frames" is crawled instead.

E.g: itemID = 2791, url http://web.archive.org/web/20080615155441/http://www.limagito.com/

drtagkim commented 10 years ago

In the source code, you can find the following tag: FRAME and NOFRAMES

Since "NOFRAMES" tries to handle rendering first, you cannot capture a source page properly.

The solution is simple.

Create an algorithm to check FRAME and the attribute NAME first. You can use the value of NAME as following:

pb = PhantomBrowser() ... pb.driver.switch_to_frame("mainwindow") # if the name is mainwindow

To check the result,

pb.page_source_save("c:/wayback_issue.html")

drtagkim commented 10 years ago

New Web3 is uploaded.

pb = PhantomBrowser() pb.goto( url , frame_switch = True)