machawk1 / wail

:whale2: Web Archiving Integration Layer: One-Click User Instigated Preservation
https://matkelly.com/wail
MIT License
345 stars 32 forks source link

Building a job via the GUI does not update adjacent status #492

Closed machawk1 closed 3 years ago

machawk1 commented 3 years ago

In Advanced>Heritrix, if a crawl job has been created but not yet built (crawler-beans.cxml is the sole file), and "Rebuild crawl job" is selected from the contextual menu, the information about the job in the right column still reflects that the job is NOT BUILT.

Viewing this job in the Heritrix UI shows that it has been built on the command being invoked from the WAIL UI.

Current main branch, macOS 11.1, Python 3.9, tested via built .app and from source.

machawk1 commented 3 years ago

This may indicate that the logic of detecting whether a job has been built is not representative.

machawk1 commented 3 years ago

The above is incorrect -- the logic to detect whether a job has been built (get_current_stats()) is sound.

However, with just a crawler-beans.cxml, selecting Rebuild crawl job in the context menu for the job in the WAIL GUI does not, in fact, cause Heritrix to rebuild the job. This is apparent by selecting Rebuild crawl job then examining the job's directory to note that no other files/folder exist beyond the .cxml.

machawk1 commented 3 years ago

It appears that the "build" occurs for the Heritrix job but nothing is logged and no files are created. The launch button is also available after the build command is invoked from WAIL. How do we read whether a job has been built? Looking at the job's directory contents appears to be insufficient.

machawk1 commented 3 years ago

https://heritrix.readthedocs.io/en/latest/api.html says that the build API creates Java objects but there is no indication in the file system or job log.

machawk1 commented 3 years ago

Rather than looking to the file system for an indication of the job's status, we can again query the API, e.g., when unbuilt:


% curl -k -u lorem:ipsum --anyauth --location -H "Accept: application/xml" https://localhost:8443/engine/job/1619799533
<?xml version="1.0" standalone='yes'?>

<job>
  <shortName>1619799533</shortName>
  <statusDescription>Unbuilt</statusDescription>
  <availableActions>
    <value>build</value>
    <value>launch</value>
  </availableActions>
  <launchCount>0</launchCount>
  <lastLaunch>2021-04-30T16:19:11.908Z</lastLaunch>
  <isProfile>false</isProfile>
  <primaryConfig>/Applications/WAIL.app/bundledApps/heritrix-3.2.0/jobs/1619799533/crawler-beans.cxml</primaryConfig>
  <primaryConfigUrl>https://localhost:8443/engine/job/1619799533/jobdir/crawler-beans.cxml</primaryConfigUrl>
  <url>https://localhost:8443/engine/job/1619799533/job/1619799533</url>
  <jobLogTail></jobLogTail>
  <uriTotalsReport/>
  <sizeTotalsReport>
    <dupByHash>0</dupByHash>
    <dupByHashCount>0</dupByHashCount>
    <novel>0</novel>
    <novelCount>0</novelCount>
    <notModified>0</notModified>
    <notModifiedCount>0</notModifiedCount>
    <total>0</total>
    <totalCount>0</totalCount>
  </sizeTotalsReport>
  <rateReport/>
  <loadReport/>
  <elapsedReport/>
  <threadReport/>
  <frontierReport/>
  <crawlLogTail></crawlLogTail>
  <configFiles></configFiles>
  <isLaunchInfoPartial>false</isLaunchInfoPartial>
  <isRunning>false</isRunning>
  <isLaunchable>true</isLaunchable>
  <hasApplicationContext>false</hasApplicationContext>
  <alertCount>0</alertCount>
  <checkpointFiles></checkpointFiles>
  <reports></reports>
  <heapReport>
    <usedBytes>42298072</usedBytes>
    <totalBytes>125304832</totalBytes>
    <maxBytes>238551040</maxBytes>
  </heapReport>
</job>

Once built (change curl command to a GET instead of POST):

<?xml version="1.0" standalone='yes'?>

<job>
  <shortName>1619799533</shortName>
  <crawlControllerState>NASCENT</crawlControllerState>
  <statusDescription>Ready</statusDescription>
  <availableActions>
    <value>launch</value>
    <value>teardown</value>
  </availableActions>
  <launchCount>0</launchCount>
  <lastLaunch>2021-04-30T16:19:11.908Z</lastLaunch>
  <isProfile>false</isProfile>
  <primaryConfig>/Applications/WAIL.app/bundledApps/heritrix-3.2.0/jobs/1619799533/crawler-beans.cxml</primaryConfig>
  <primaryConfigUrl>https://localhost:8443/engine/job/1619799533/jobdir/crawler-beans.cxml</primaryConfigUrl>
  <url>https://localhost:8443/engine/job/1619799533/job/1619799533</url>
  <jobLogTail></jobLogTail>
  <uriTotalsReport>
    <downloadedUriCount>0</downloadedUriCount>
    <queuedUriCount>0</queuedUriCount>
    <totalUriCount>0</totalUriCount>
    <futureUriCount>0</futureUriCount>
  </uriTotalsReport>
  <sizeTotalsReport>
    <dupByHash>0</dupByHash>
    <dupByHashCount>0</dupByHashCount>
    <notModified>0</notModified>
    <notModifiedCount>0</notModifiedCount>
    <novel>0</novel>
    <novelCount>0</novelCount>
    <total>0</total>
    <totalCount>0</totalCount>
  </sizeTotalsReport>
  <rateReport>
    <currentDocsPerSecond>0.0</currentDocsPerSecond>
    <averageDocsPerSecond>NaN</averageDocsPerSecond>
    <currentKiBPerSec>0</currentKiBPerSec>
    <averageKiBPerSec>0</averageKiBPerSec>
  </rateReport>
  <loadReport>
    <busyThreads>0</busyThreads>
    <totalThreads>0</totalThreads>
    <congestionRatio>0.0</congestionRatio>
    <averageQueueDepth>0</averageQueueDepth>
    <deepestQueueDepth>-1</deepestQueueDepth>
  </loadReport>
  <elapsedReport>
    <elapsedMilliseconds>0</elapsedMilliseconds>
    <elapsedPretty>0ms</elapsedPretty>
  </elapsedReport>
  <threadReport/>
  <frontierReport/>
  <crawlLogTail></crawlLogTail>
  <configFiles>
    <value>
      <key>loggerModule.crawlLogPath</key>
      <name>crawl.log</name>
      <path>/Applications/WAIL.app/bundledApps/heritrix-3.2.0/jobs/1619799533/${launchId}/logs/crawl.log</path>
      <url>https://localhost:8443/engine/job/1619799533/engine/anypath//Applications/WAIL.app/bundledApps/heritrix-3.2.0/jobs/1619799533/${launchId}/logs/crawl.log</url>
      <editable>false</editable>
    </value>
    <value>
      <key>actionDirectory.actionDir</key>
      <name>ActionDirectory source directory</name>
      <path>/Applications/WAIL.app/bundledApps/heritrix-3.2.0/jobs/1619799533/action</path>
      <url>https://localhost:8443/engine/job/1619799533/engine/anypath//Applications/WAIL.app/bundledApps/heritrix-3.2.0/jobs/1619799533/action</url>
      <editable>false</editable>
    </value>
    <value>
      <key>loggerModule.path</key>
      <name>logs subdirectory</name>
      <path>/Applications/WAIL.app/bundledApps/heritrix-3.2.0/jobs/1619799533/${launchId}/logs</path>
      <url>https://localhost:8443/engine/job/1619799533/engine/anypath//Applications/WAIL.app/bundledApps/heritrix-3.2.0/jobs/1619799533/${launchId}/logs</url>
      <editable>false</editable>
    </value>
    <value>
      <key>loggerModule.nonfatalErrorsLogPath</key>
      <name>nonfatal-errors.log</name>
      <path>/Applications/WAIL.app/bundledApps/heritrix-3.2.0/jobs/1619799533/${launchId}/logs/nonfatal-errors.log</path>
      <url>https://localhost:8443/engine/job/1619799533/engine/anypath//Applications/WAIL.app/bundledApps/heritrix-3.2.0/jobs/1619799533/${launchId}/logs/nonfatal-errors.log</url>
      <editable>false</editable>
    </value>
    <value>
      <key>statisticsTracker.reportsDir</key>
      <name>reports subdirectory</name>
      <path>/Applications/WAIL.app/bundledApps/heritrix-3.2.0/jobs/1619799533/${launchId}/reports</path>
      <url>https://localhost:8443/engine/job/1619799533/engine/anypath//Applications/WAIL.app/bundledApps/heritrix-3.2.0/jobs/1619799533/${launchId}/reports</url>
      <editable>false</editable>
    </value>
    <value>
      <key>warcWriter.storePaths[0]</key>
      <name>warcWriter.storePaths[0]</name>
      <path>/Applications/WAIL.app/archives</path>
      <url>https://localhost:8443/engine/job/1619799533/engine/anypath//Applications/WAIL.app/archives</url>
      <editable>false</editable>
    </value>
    <value>
      <key>checkpointService.checkpointsDir</key>
      <name>checkpoints subdirectory</name>
      <path>/Applications/WAIL.app/bundledApps/heritrix-3.2.0/jobs/1619799533/checkpoints</path>
      <url>https://localhost:8443/engine/job/1619799533/engine/anypath//Applications/WAIL.app/bundledApps/heritrix-3.2.0/jobs/1619799533/checkpoints</url>
      <editable>false</editable>
    </value>
    <value>
      <key>warcWriter.defaultStorePaths[0]</key>
      <name>warcs default store path</name>
      <path>/Applications/WAIL.app/bundledApps/heritrix-3.2.0/jobs/1619799533/warcs</path>
      <url>https://localhost:8443/engine/job/1619799533/engine/anypath//Applications/WAIL.app/bundledApps/heritrix-3.2.0/jobs/1619799533/warcs</url>
      <editable>false</editable>
    </value>
    <value>
      <key>actionDirectory.doneDir</key>
      <name>ActionDirectory done directory</name>
      <path>/Applications/WAIL.app/bundledApps/heritrix-3.2.0/jobs/1619799533/${launchId}/actions-done</path>
      <url>https://localhost:8443/engine/job/1619799533/engine/anypath//Applications/WAIL.app/bundledApps/heritrix-3.2.0/jobs/1619799533/${launchId}/actions-done</url>
      <editable>false</editable>
    </value>
    <value>
      <key>crawlController.scratchDir</key>
      <name>scratch subdirectory</name>
      <path>/Applications/WAIL.app/bundledApps/heritrix-3.2.0/jobs/1619799533/scratch</path>
      <url>https://localhost:8443/engine/job/1619799533/engine/anypath//Applications/WAIL.app/bundledApps/heritrix-3.2.0/jobs/1619799533/scratch</url>
      <editable>false</editable>
    </value>
    <value>
      <key>warcWriter.directory</key>
      <name>writer base path</name>
      <path>/Applications/WAIL.app/bundledApps/heritrix-3.2.0/jobs/1619799533/${launchId}</path>
      <url>https://localhost:8443/engine/job/1619799533/engine/anypath//Applications/WAIL.app/bundledApps/heritrix-3.2.0/jobs/1619799533/${launchId}</url>
      <editable>false</editable>
    </value>
    <value>
      <key>loggerModule.uriErrorsLogPath</key>
      <name>uri-errors.log</name>
      <path>/Applications/WAIL.app/bundledApps/heritrix-3.2.0/jobs/1619799533/${launchId}/logs/uri-errors.log</path>
      <url>https://localhost:8443/engine/job/1619799533/engine/anypath//Applications/WAIL.app/bundledApps/heritrix-3.2.0/jobs/1619799533/${launchId}/logs/uri-errors.log</url>
      <editable>false</editable>
    </value>
    <value>
      <key>bdb.dir</key>
      <name>bdbmodule subdirectory</name>
      <path>/Applications/WAIL.app/bundledApps/heritrix-3.2.0/jobs/1619799533/state</path>
      <url>https://localhost:8443/engine/job/1619799533/engine/anypath//Applications/WAIL.app/bundledApps/heritrix-3.2.0/jobs/1619799533/state</url>
      <editable>false</editable>
    </value>
    <value>
      <key>org.archive.modules.deciderules.surt.SurtPrefixedDecideRule#655490cd.surtsDumpFile</key>
      <name>surtsDumpFile</name>
      <path>/Applications/WAIL.app/bundledApps/heritrix-3.2.0/jobs/1619799533/${launchId}/negative-surts.dump</path>
      <url>https://localhost:8443/engine/job/1619799533/engine/anypath//Applications/WAIL.app/bundledApps/heritrix-3.2.0/jobs/1619799533/${launchId}/negative-surts.dump</url>
      <editable>false</editable>
    </value>
    <value>
      <key>loggerModule.progressLogPath</key>
      <name>progress-statistics.log</name>
      <path>/Applications/WAIL.app/bundledApps/heritrix-3.2.0/jobs/1619799533/${launchId}/logs/progress-statistics.log</path>
      <url>https://localhost:8443/engine/job/1619799533/engine/anypath//Applications/WAIL.app/bundledApps/heritrix-3.2.0/jobs/1619799533/${launchId}/logs/progress-statistics.log</url>
      <editable>false</editable>
    </value>
    <value>
      <key>org.archive.modules.deciderules.surt.SurtPrefixedDecideRule#4991a0c.surtsDumpFile</key>
      <name>surtsDumpFile</name>
      <path>/Applications/WAIL.app/bundledApps/heritrix-3.2.0/jobs/1619799533/${launchId}/surts.dump</path>
      <url>https://localhost:8443/engine/job/1619799533/engine/anypath//Applications/WAIL.app/bundledApps/heritrix-3.2.0/jobs/1619799533/${launchId}/surts.dump</url>
      <editable>false</editable>
    </value>
    <value>
      <key>loggerModule.alertsLogPath</key>
      <name>alerts.log</name>
      <path>/Applications/WAIL.app/bundledApps/heritrix-3.2.0/jobs/1619799533/${launchId}/logs/alerts.log</path>
      <url>https://localhost:8443/engine/job/1619799533/engine/anypath//Applications/WAIL.app/bundledApps/heritrix-3.2.0/jobs/1619799533/${launchId}/logs/alerts.log</url>
      <editable>false</editable>
    </value>
    <value>
      <key>loggerModule.runtimeErrorsLogPath</key>
      <name>runtime-errors.log</name>
      <path>/Applications/WAIL.app/bundledApps/heritrix-3.2.0/jobs/1619799533/${launchId}/logs/runtime-errors.log</path>
      <url>https://localhost:8443/engine/job/1619799533/engine/anypath//Applications/WAIL.app/bundledApps/heritrix-3.2.0/jobs/1619799533/${launchId}/logs/runtime-errors.log</url>
      <editable>false</editable>
    </value>
  </configFiles>
  <isLaunchInfoPartial>false</isLaunchInfoPartial>
  <isRunning>false</isRunning>
  <isLaunchable>true</isLaunchable>
  <hasApplicationContext>true</hasApplicationContext>
  <alertCount>0</alertCount>
  <checkpointFiles></checkpointFiles>
  <alertLogFilePath>/Applications/WAIL.app/bundledApps/heritrix-3.2.0/./jobs/1619799533/${launchId}/logs/alerts.log</alertLogFilePath>
  <reports>
    <value>
      <className>CrawlSummaryReport</className>
      <shortName>CrawlSummary</shortName>
    </value>
    <value>
      <className>SeedsReport</className>
      <shortName>Seeds</shortName>
    </value>
    <value>
      <className>HostsReport</className>
      <shortName>Hosts</shortName>
    </value>
    <value>
      <className>SourceTagsReport</className>
      <shortName>SourceTags</shortName>
    </value>
    <value>
      <className>MimetypesReport</className>
      <shortName>Mimetypes</shortName>
    </value>
    <value>
      <className>ResponseCodeReport</className>
      <shortName>ResponseCode</shortName>
    </value>
    <value>
      <className>ProcessorsReport</className>
      <shortName>Processors</shortName>
    </value>
    <value>
      <className>FrontierSummaryReport</className>
      <shortName>FrontierSummary</shortName>
    </value>
    <value>
      <className>ToeThreadsReport</className>
      <shortName>ToeThreads</shortName>
    </value>
  </reports>
  <heapReport>
    <usedBytes>72155432</usedBytes>
    <totalBytes>151519232</totalBytes>
    <maxBytes>238551040</maxBytes>
  </heapReport>
</job>

Tasks: