internetarchive / brozzler

brozzler - distributed browser-based web crawler
Apache License 2.0
669 stars 97 forks source link

How to connect db entries from the table "sites" to a belonging warc-file? #156

Open mxnx1 opened 5 years ago

mxnx1 commented 5 years ago

Hi brozzler-team,

I want to export database entries belonging to a specific warc-file, from the tables jobs, sites and pages. I Know how connect those tables to each other, but i couldn't find a connection to the table captures or directly to the belonging warc-file.

Is it working via the "WARC_Date" in the warcinfo record of the warc-file and "last_claimed" in the table sites?

A hint Would be great. Thx.

nlevitt commented 5 years ago

You can set the warc prefix using warcprox-meta as shown here: https://github.com/internetarchive/brozzler/blob/master/job-conf.rst#using-warcprox-meta

If you don't, captures from all your jobs and sites will be mixed together in the same warcs.

mxnx1 commented 5 years ago

thank you for your reply. I use the warc_prefix, but I have several warc-files with the same warc-prefix, differentiating through timestamp and some id, which are created automatically.

example of warc file names: Chile_Google_Search_Countries-20190609202650584-13pemhtq-00000.warc.gz Chile_Google_Search_Countries-20190526122808538-7aek1ud9-00000.warc.gz


on brozzler dashboard the navigation through it and to the captured content goes via Jobs - sites - pages - wayback. so the table entries are explicitly connected to the belonging warc-files.

i understand how the tables jobs, sites and pages are connected - via job_id and site_id. But i am wondering how brozzler is connecting the warc-files to its table entries (jobs, sites, pages).

i need this connection for exporting the belonging informations (in jobs, sites, pages) about the warc-files from the database.

Can you tell me how brozzler connect the warc-files to its table entries jobs, sites, pages?


part of sites entry:

"active_brozzling_time": 31.814205646514893 ,
"claimed": false ,
"cookie_db": <binary, 20.0KB, "53 51 4c 69 74 65..."> ,
"id": "7133eeeb-9e57-4ccf-837d-08e427c1a4fa" ,
"ignore_robots": true ,
"job_id": "google_search_countries_09062019" ,
"last_claimed": Sun Jun 09 2019 20:26:49 GMT+00:00 ,                
"last_claimed_by": "xxxxxxx" ,
"last_disclaimed": Sun Jun 09 2019 20:27:21 GMT+00:00 , 
....

"warcprox_meta": {
"warc-prefix": "Chile_Google_Search_Countries"                      
}

example of warcinfo record:

WARC/1.0 WARC-Record-ID: WARC-Type: warcinfo WARC-Filename: Chile_Google_Search_Countries-20190609202650584-13pemhtq-00000.warc.gz
WARC-Date: 2019-06-09T20:26:50Z
Content-Type: application/warc-fields Content-Length: 99

software: warcprox 2.4b6 hostname: xxxxxxx ip: xxxxxxxx format: WARC File Format 1.0