Open mxnx1 opened 5 years ago
You can set the warc prefix using warcprox-meta as shown here: https://github.com/internetarchive/brozzler/blob/master/job-conf.rst#using-warcprox-meta
If you don't, captures from all your jobs and sites will be mixed together in the same warcs.
thank you for your reply. I use the warc_prefix, but I have several warc-files with the same warc-prefix, differentiating through timestamp and some id, which are created automatically.
example of warc file names: Chile_Google_Search_Countries-20190609202650584-13pemhtq-00000.warc.gz Chile_Google_Search_Countries-20190526122808538-7aek1ud9-00000.warc.gz
on brozzler dashboard the navigation through it and to the captured content goes via Jobs - sites - pages - wayback. so the table entries are explicitly connected to the belonging warc-files.
i understand how the tables jobs, sites and pages are connected - via job_id and site_id. But i am wondering how brozzler is connecting the warc-files to its table entries (jobs, sites, pages).
i can imagine an improvised connection, but it is not very explicit: I think about a combination between the warc-prefix and the date from "last_claimed" from the table sites, to find the matching warc-file via its filename or its WARC-Date. But the date from Warc-Date (warcinfo record) and last_claimed (table sites) are not totaly similar and differ one second.
i am missing an explicit corresponding field.
i need this connection for exporting the belonging informations (in jobs, sites, pages) about the warc-files from the database.
Can you tell me how brozzler connect the warc-files to its table entries jobs, sites, pages?
part of sites entry:
"active_brozzling_time": 31.814205646514893 ,
"claimed": false ,
"cookie_db": <binary, 20.0KB, "53 51 4c 69 74 65..."> ,
"id": "7133eeeb-9e57-4ccf-837d-08e427c1a4fa" ,
"ignore_robots": true ,
"job_id": "google_search_countries_09062019" ,
"last_claimed": Sun Jun 09 2019 20:26:49 GMT+00:00 ,
"last_claimed_by": "xxxxxxx" ,
"last_disclaimed": Sun Jun 09 2019 20:27:21 GMT+00:00 ,
....
"warcprox_meta": {
"warc-prefix": "Chile_Google_Search_Countries"
}
example of warcinfo record:
WARC/1.0
WARC-Record-ID:
WARC-Date: 2019-06-09T20:26:50Z
Content-Type: application/warc-fields
Content-Length: 99
software: warcprox 2.4b6 hostname: xxxxxxx ip: xxxxxxxx format: WARC File Format 1.0
Hi brozzler-team,
I want to export database entries belonging to a specific warc-file, from the tables jobs, sites and pages. I Know how connect those tables to each other, but i couldn't find a connection to the table captures or directly to the belonging warc-file.
Is it working via the "WARC_Date" in the warcinfo record of the warc-file and "last_claimed" in the table sites?
A hint Would be great. Thx.