oduwsdl / CarbonDate

Estimating the age of web resources
MIT License
91 stars 11 forks source link

Bulk carbon-dating feature #31

Open cjer opened 5 years ago

cjer commented 5 years ago

I am looking to carbon-date a list of tens or hundreds of thousands URIs. Currently running a (pretty bad and inefficient) short script I wrote that runs main.py with a different URI parameter for each URI in a line-separated text file. Was wondering whether I was missing something that already does this or something similar in this repository or elsewhere.

Thanks!

ibnesayeed commented 5 years ago

The tool is designed for one URI at a time in the CLI mode. This makes the code logic and the response structure simple. Besides, we don't see any performance benefits if the tool were to take multiple URIs or an input file as a parameter, because processing each URI is independent and quite time consuming and the time to boot the script up is negligible in comparison.

anwala commented 5 years ago

I think another option is to run it in server mode, then make parallel requests against the server, e.g., 5 threads depending on the capabilities of your machine. But you may want to check from time to time if the server is alive and switch it on if it's off. I suppose you're saving the responses independently not in one file, such that you can restart without losing data.

ibnesayeed commented 5 years ago

Parallel processing is possible both in server and one off modes. However, one has to realize that it is a network intensive task not a processor intensive one. This means, many parallel requests to various upstream services might cause rate limiting to be kicked in.