Closed machawk1 closed 3 years ago
This may indicate that the logic of detecting whether a job has been built is not representative.
The above is incorrect -- the logic to detect whether a job has been built (get_current_stats()
) is sound.
However, with just a crawler-beans.cxml
, selecting Rebuild crawl job
in the context menu for the job in the WAIL GUI does not, in fact, cause Heritrix to rebuild the job. This is apparent by selecting Rebuild crawl job
then examining the job's directory to note that no other files/folder exist beyond the .cxml
.
It appears that the "build" occurs for the Heritrix job but nothing is logged and no files are created. The launch
button is also available after the build
command is invoked from WAIL. How do we read whether a job has been built? Looking at the job's directory contents appears to be insufficient.
https://heritrix.readthedocs.io/en/latest/api.html says that the build API creates Java objects but there is no indication in the file system or job log.
Rather than looking to the file system for an indication of the job's status, we can again query the API, e.g., when unbuilt:
% curl -k -u lorem:ipsum --anyauth --location -H "Accept: application/xml" https://localhost:8443/engine/job/1619799533
<?xml version="1.0" standalone='yes'?>
<job>
<shortName>1619799533</shortName>
<statusDescription>Unbuilt</statusDescription>
<availableActions>
<value>build</value>
<value>launch</value>
</availableActions>
<launchCount>0</launchCount>
<lastLaunch>2021-04-30T16:19:11.908Z</lastLaunch>
<isProfile>false</isProfile>
<primaryConfig>/Applications/WAIL.app/bundledApps/heritrix-3.2.0/jobs/1619799533/crawler-beans.cxml</primaryConfig>
<primaryConfigUrl>https://localhost:8443/engine/job/1619799533/jobdir/crawler-beans.cxml</primaryConfigUrl>
<url>https://localhost:8443/engine/job/1619799533/job/1619799533</url>
<jobLogTail></jobLogTail>
<uriTotalsReport/>
<sizeTotalsReport>
<dupByHash>0</dupByHash>
<dupByHashCount>0</dupByHashCount>
<novel>0</novel>
<novelCount>0</novelCount>
<notModified>0</notModified>
<notModifiedCount>0</notModifiedCount>
<total>0</total>
<totalCount>0</totalCount>
</sizeTotalsReport>
<rateReport/>
<loadReport/>
<elapsedReport/>
<threadReport/>
<frontierReport/>
<crawlLogTail></crawlLogTail>
<configFiles></configFiles>
<isLaunchInfoPartial>false</isLaunchInfoPartial>
<isRunning>false</isRunning>
<isLaunchable>true</isLaunchable>
<hasApplicationContext>false</hasApplicationContext>
<alertCount>0</alertCount>
<checkpointFiles></checkpointFiles>
<reports></reports>
<heapReport>
<usedBytes>42298072</usedBytes>
<totalBytes>125304832</totalBytes>
<maxBytes>238551040</maxBytes>
</heapReport>
</job>
Once built (change curl command to a GET instead of POST):
<?xml version="1.0" standalone='yes'?>
<job>
<shortName>1619799533</shortName>
<crawlControllerState>NASCENT</crawlControllerState>
<statusDescription>Ready</statusDescription>
<availableActions>
<value>launch</value>
<value>teardown</value>
</availableActions>
<launchCount>0</launchCount>
<lastLaunch>2021-04-30T16:19:11.908Z</lastLaunch>
<isProfile>false</isProfile>
<primaryConfig>/Applications/WAIL.app/bundledApps/heritrix-3.2.0/jobs/1619799533/crawler-beans.cxml</primaryConfig>
<primaryConfigUrl>https://localhost:8443/engine/job/1619799533/jobdir/crawler-beans.cxml</primaryConfigUrl>
<url>https://localhost:8443/engine/job/1619799533/job/1619799533</url>
<jobLogTail></jobLogTail>
<uriTotalsReport>
<downloadedUriCount>0</downloadedUriCount>
<queuedUriCount>0</queuedUriCount>
<totalUriCount>0</totalUriCount>
<futureUriCount>0</futureUriCount>
</uriTotalsReport>
<sizeTotalsReport>
<dupByHash>0</dupByHash>
<dupByHashCount>0</dupByHashCount>
<notModified>0</notModified>
<notModifiedCount>0</notModifiedCount>
<novel>0</novel>
<novelCount>0</novelCount>
<total>0</total>
<totalCount>0</totalCount>
</sizeTotalsReport>
<rateReport>
<currentDocsPerSecond>0.0</currentDocsPerSecond>
<averageDocsPerSecond>NaN</averageDocsPerSecond>
<currentKiBPerSec>0</currentKiBPerSec>
<averageKiBPerSec>0</averageKiBPerSec>
</rateReport>
<loadReport>
<busyThreads>0</busyThreads>
<totalThreads>0</totalThreads>
<congestionRatio>0.0</congestionRatio>
<averageQueueDepth>0</averageQueueDepth>
<deepestQueueDepth>-1</deepestQueueDepth>
</loadReport>
<elapsedReport>
<elapsedMilliseconds>0</elapsedMilliseconds>
<elapsedPretty>0ms</elapsedPretty>
</elapsedReport>
<threadReport/>
<frontierReport/>
<crawlLogTail></crawlLogTail>
<configFiles>
<value>
<key>loggerModule.crawlLogPath</key>
<name>crawl.log</name>
<path>/Applications/WAIL.app/bundledApps/heritrix-3.2.0/jobs/1619799533/${launchId}/logs/crawl.log</path>
<url>https://localhost:8443/engine/job/1619799533/engine/anypath//Applications/WAIL.app/bundledApps/heritrix-3.2.0/jobs/1619799533/${launchId}/logs/crawl.log</url>
<editable>false</editable>
</value>
<value>
<key>actionDirectory.actionDir</key>
<name>ActionDirectory source directory</name>
<path>/Applications/WAIL.app/bundledApps/heritrix-3.2.0/jobs/1619799533/action</path>
<url>https://localhost:8443/engine/job/1619799533/engine/anypath//Applications/WAIL.app/bundledApps/heritrix-3.2.0/jobs/1619799533/action</url>
<editable>false</editable>
</value>
<value>
<key>loggerModule.path</key>
<name>logs subdirectory</name>
<path>/Applications/WAIL.app/bundledApps/heritrix-3.2.0/jobs/1619799533/${launchId}/logs</path>
<url>https://localhost:8443/engine/job/1619799533/engine/anypath//Applications/WAIL.app/bundledApps/heritrix-3.2.0/jobs/1619799533/${launchId}/logs</url>
<editable>false</editable>
</value>
<value>
<key>loggerModule.nonfatalErrorsLogPath</key>
<name>nonfatal-errors.log</name>
<path>/Applications/WAIL.app/bundledApps/heritrix-3.2.0/jobs/1619799533/${launchId}/logs/nonfatal-errors.log</path>
<url>https://localhost:8443/engine/job/1619799533/engine/anypath//Applications/WAIL.app/bundledApps/heritrix-3.2.0/jobs/1619799533/${launchId}/logs/nonfatal-errors.log</url>
<editable>false</editable>
</value>
<value>
<key>statisticsTracker.reportsDir</key>
<name>reports subdirectory</name>
<path>/Applications/WAIL.app/bundledApps/heritrix-3.2.0/jobs/1619799533/${launchId}/reports</path>
<url>https://localhost:8443/engine/job/1619799533/engine/anypath//Applications/WAIL.app/bundledApps/heritrix-3.2.0/jobs/1619799533/${launchId}/reports</url>
<editable>false</editable>
</value>
<value>
<key>warcWriter.storePaths[0]</key>
<name>warcWriter.storePaths[0]</name>
<path>/Applications/WAIL.app/archives</path>
<url>https://localhost:8443/engine/job/1619799533/engine/anypath//Applications/WAIL.app/archives</url>
<editable>false</editable>
</value>
<value>
<key>checkpointService.checkpointsDir</key>
<name>checkpoints subdirectory</name>
<path>/Applications/WAIL.app/bundledApps/heritrix-3.2.0/jobs/1619799533/checkpoints</path>
<url>https://localhost:8443/engine/job/1619799533/engine/anypath//Applications/WAIL.app/bundledApps/heritrix-3.2.0/jobs/1619799533/checkpoints</url>
<editable>false</editable>
</value>
<value>
<key>warcWriter.defaultStorePaths[0]</key>
<name>warcs default store path</name>
<path>/Applications/WAIL.app/bundledApps/heritrix-3.2.0/jobs/1619799533/warcs</path>
<url>https://localhost:8443/engine/job/1619799533/engine/anypath//Applications/WAIL.app/bundledApps/heritrix-3.2.0/jobs/1619799533/warcs</url>
<editable>false</editable>
</value>
<value>
<key>actionDirectory.doneDir</key>
<name>ActionDirectory done directory</name>
<path>/Applications/WAIL.app/bundledApps/heritrix-3.2.0/jobs/1619799533/${launchId}/actions-done</path>
<url>https://localhost:8443/engine/job/1619799533/engine/anypath//Applications/WAIL.app/bundledApps/heritrix-3.2.0/jobs/1619799533/${launchId}/actions-done</url>
<editable>false</editable>
</value>
<value>
<key>crawlController.scratchDir</key>
<name>scratch subdirectory</name>
<path>/Applications/WAIL.app/bundledApps/heritrix-3.2.0/jobs/1619799533/scratch</path>
<url>https://localhost:8443/engine/job/1619799533/engine/anypath//Applications/WAIL.app/bundledApps/heritrix-3.2.0/jobs/1619799533/scratch</url>
<editable>false</editable>
</value>
<value>
<key>warcWriter.directory</key>
<name>writer base path</name>
<path>/Applications/WAIL.app/bundledApps/heritrix-3.2.0/jobs/1619799533/${launchId}</path>
<url>https://localhost:8443/engine/job/1619799533/engine/anypath//Applications/WAIL.app/bundledApps/heritrix-3.2.0/jobs/1619799533/${launchId}</url>
<editable>false</editable>
</value>
<value>
<key>loggerModule.uriErrorsLogPath</key>
<name>uri-errors.log</name>
<path>/Applications/WAIL.app/bundledApps/heritrix-3.2.0/jobs/1619799533/${launchId}/logs/uri-errors.log</path>
<url>https://localhost:8443/engine/job/1619799533/engine/anypath//Applications/WAIL.app/bundledApps/heritrix-3.2.0/jobs/1619799533/${launchId}/logs/uri-errors.log</url>
<editable>false</editable>
</value>
<value>
<key>bdb.dir</key>
<name>bdbmodule subdirectory</name>
<path>/Applications/WAIL.app/bundledApps/heritrix-3.2.0/jobs/1619799533/state</path>
<url>https://localhost:8443/engine/job/1619799533/engine/anypath//Applications/WAIL.app/bundledApps/heritrix-3.2.0/jobs/1619799533/state</url>
<editable>false</editable>
</value>
<value>
<key>org.archive.modules.deciderules.surt.SurtPrefixedDecideRule#655490cd.surtsDumpFile</key>
<name>surtsDumpFile</name>
<path>/Applications/WAIL.app/bundledApps/heritrix-3.2.0/jobs/1619799533/${launchId}/negative-surts.dump</path>
<url>https://localhost:8443/engine/job/1619799533/engine/anypath//Applications/WAIL.app/bundledApps/heritrix-3.2.0/jobs/1619799533/${launchId}/negative-surts.dump</url>
<editable>false</editable>
</value>
<value>
<key>loggerModule.progressLogPath</key>
<name>progress-statistics.log</name>
<path>/Applications/WAIL.app/bundledApps/heritrix-3.2.0/jobs/1619799533/${launchId}/logs/progress-statistics.log</path>
<url>https://localhost:8443/engine/job/1619799533/engine/anypath//Applications/WAIL.app/bundledApps/heritrix-3.2.0/jobs/1619799533/${launchId}/logs/progress-statistics.log</url>
<editable>false</editable>
</value>
<value>
<key>org.archive.modules.deciderules.surt.SurtPrefixedDecideRule#4991a0c.surtsDumpFile</key>
<name>surtsDumpFile</name>
<path>/Applications/WAIL.app/bundledApps/heritrix-3.2.0/jobs/1619799533/${launchId}/surts.dump</path>
<url>https://localhost:8443/engine/job/1619799533/engine/anypath//Applications/WAIL.app/bundledApps/heritrix-3.2.0/jobs/1619799533/${launchId}/surts.dump</url>
<editable>false</editable>
</value>
<value>
<key>loggerModule.alertsLogPath</key>
<name>alerts.log</name>
<path>/Applications/WAIL.app/bundledApps/heritrix-3.2.0/jobs/1619799533/${launchId}/logs/alerts.log</path>
<url>https://localhost:8443/engine/job/1619799533/engine/anypath//Applications/WAIL.app/bundledApps/heritrix-3.2.0/jobs/1619799533/${launchId}/logs/alerts.log</url>
<editable>false</editable>
</value>
<value>
<key>loggerModule.runtimeErrorsLogPath</key>
<name>runtime-errors.log</name>
<path>/Applications/WAIL.app/bundledApps/heritrix-3.2.0/jobs/1619799533/${launchId}/logs/runtime-errors.log</path>
<url>https://localhost:8443/engine/job/1619799533/engine/anypath//Applications/WAIL.app/bundledApps/heritrix-3.2.0/jobs/1619799533/${launchId}/logs/runtime-errors.log</url>
<editable>false</editable>
</value>
</configFiles>
<isLaunchInfoPartial>false</isLaunchInfoPartial>
<isRunning>false</isRunning>
<isLaunchable>true</isLaunchable>
<hasApplicationContext>true</hasApplicationContext>
<alertCount>0</alertCount>
<checkpointFiles></checkpointFiles>
<alertLogFilePath>/Applications/WAIL.app/bundledApps/heritrix-3.2.0/./jobs/1619799533/${launchId}/logs/alerts.log</alertLogFilePath>
<reports>
<value>
<className>CrawlSummaryReport</className>
<shortName>CrawlSummary</shortName>
</value>
<value>
<className>SeedsReport</className>
<shortName>Seeds</shortName>
</value>
<value>
<className>HostsReport</className>
<shortName>Hosts</shortName>
</value>
<value>
<className>SourceTagsReport</className>
<shortName>SourceTags</shortName>
</value>
<value>
<className>MimetypesReport</className>
<shortName>Mimetypes</shortName>
</value>
<value>
<className>ResponseCodeReport</className>
<shortName>ResponseCode</shortName>
</value>
<value>
<className>ProcessorsReport</className>
<shortName>Processors</shortName>
</value>
<value>
<className>FrontierSummaryReport</className>
<shortName>FrontierSummary</shortName>
</value>
<value>
<className>ToeThreadsReport</className>
<shortName>ToeThreads</shortName>
</value>
</reports>
<heapReport>
<usedBytes>72155432</usedBytes>
<totalBytes>151519232</totalBytes>
<maxBytes>238551040</maxBytes>
</heapReport>
</job>
Tasks:
statusDescription
tag. issue-492 branch
In Advanced>Heritrix, if a crawl job has been created but not yet built (crawler-beans.cxml is the sole file), and "Rebuild crawl job" is selected from the contextual menu, the information about the job in the right column still reflects that the job is NOT BUILT.
Viewing this job in the Heritrix UI shows that it has been built on the command being invoked from the WAIL UI.
Current main branch, macOS 11.1, Python 3.9, tested via built .app and from source.