trthanhquang / wayback-data-collector

Software Data Collection Using Wayback Machine
Apache License 2.0
2 stars 0 forks source link

What I expect from you #4

Open drtagkim opened 10 years ago

drtagkim commented 10 years ago

Hello all:

For two weeks, I received lots of messages from you guys. Thanks. Glad to hear. But those messages were mostly far from what I had expected from you.

As I told you when we met, the waybackmachine.py has many many problems. This code was only used for testing basic concepts and getting information and strategy to scrap data from Wayback Machine. STOP TESTING IT. I will complete it later.

The provided codes are to help you get basic concepts and procedures on how to get in touch with Wayback Machine when you try to collect price information. I believe you may create your own version to accomplish your mission. But at this time, just adopt or apply some of algorithms inside the waybackmachine.py. I also uploaded web3.py. Please look into it. If you do not know BeautifulSoup and CSS, spare your time for googling.

Based on waybackmachine.py, I wrote a small class (find wm2.py). Please stop playing with my toy, and focus on your original objective. If you have constructive questions, I will reply as soon as possible. Think wisely and good luck.

wyrmmm commented 10 years ago

For some websites, they're archived multiple times a day. Visiting the product's website all the times that they archived seems impractical since the price of a product probably won't be changed multiple times a day. What's the minimum number of times we should visit the website to collect information? Can we do once a month? Is that enough?

bilun167 commented 10 years ago

As I recall, the requirements are to collect snapshot once a day (if snapshot is available). Please correct me if I am wrong.

bilun167 commented 10 years ago

@chubbychubs Dr. Kim suggested the following approach: Suppose we first crawl snapshots yearly. We then need to compare to spot any difference in the content. If there is, we need to crawl another snapshot in between the time range. If there is no change, no need to extract any snapshot in between these 2.

wyrmmm commented 10 years ago

@bilun167 Thing is, I intend to crawl archives.org and compile all the snapshots first, before running the program to look for the products and prices in all these snapshots. That seems easier to code. What do you think?

bilun167 commented 10 years ago

@chubbychubs Same approach on my side. I've checked parts of your code, pretty much similarly in idea. I suggest that you create an intermediate database to store your crawl data (so that you have backup once starting to experiment with extracting product features and prices).