trthanhquang / wayback-data-collector

Software Data Collection Using Wayback Machine
Apache License 2.0
2 stars 0 forks source link

Option 2: Collect main pages yearly #2

Open wyrmmm opened 10 years ago

wyrmmm commented 10 years ago

Hi Prof,

I'm not sure what option 2: collect main pages yearly does. I ran it, and I gave it an input file path, but what's the output it's supposed to return?

kyanrong commented 10 years ago

I got a .wayback file for each year in the csv file. (somehow I have missing .wayback files. The number of wayback files do not correspond to the number of lines in the csv file)

raeyeap commented 10 years ago

What is this .wayback file? How do we see what it contains?

drtagkim commented 10 years ago

Dear all: You can find several classes for data in the module, "waybackmachine.py", like PageNode and PageData. The file with ".wayback" is a Pickle object (i.e., serialized binary data). Find PageData class and examine the function, "constructNodeExportPickle()". In order to examine .wayback, you may want to write the following:

import pickle file_name = "any.wayback" f = open(file_name) from waybackmachine import PageNode anyWaybackPageNode = pickle.load(f) f.close() print anyWaybackPageNode.url

Do not rush. Take your time for examining source codes. I expect you guys to fix several issues and errors in the half-baked library. Also, the console application (haha) is a mere example to test waybackmachine.py . You may learn about class relationships from it.