internetarchive / heritrix3

Heritrix is the Internet Archive's open-source, extensible, web-scale, archival-quality web crawler project.
https://heritrix.readthedocs.io/
Other
2.74k stars 755 forks source link

${launchId} is not being replaced (sometimes) #495

Open cgr71ii opened 1 year ago

cgr71ii commented 1 year ago

Hi,

I've observed in the code that the value "${launchId}" is expected to be replaced with a value I'm not sure what is. Anyway, I'm trying to understand the configuration file and I found that the disposition chain uses for the directory directive the value "${launchId}". If I'm not wrong, this value should create a directory with this value replaced. What it happens instead, not always but it happens, is that in the job directory there is a directory with the literal name "${launchId}". Is this expected? I've observed that there are other directives which uses this value, but I haven't checked out if this affects to these directives as well.

<!-- DISPOSITION CHAIN -->
<bean id="warcWriter" class="org.archive.modules.writer.WARCWriterChainProcessor">
  <!-- ... -->
  <!-- <property name="directory" value="${launchId}" /> -->
  <!-- ... -->
</bean>

I'm using the last version of Heritrix (last commit from master).

I think that the times this happened to me has been when the issue described in this comment happened:

ls -la /home/cgarcia/Documentos/heritrix3/build_1661889017/heritrix-3.4.0-SNAPSHOT/jobs/clashroyale/\$\{launchId\}/

# total 12
# drwxrwxr-x 3 cgarcia cgarcia 4096 ago 31 11:31 .
# drwxrwxr-x 8 cgarcia cgarcia 4096 ago 31 19:16 ..
# drwxrwxr-x 2 cgarcia cgarcia 4096 ago 31 11:31 reports

ls -la /home/cgarcia/Documentos/heritrix3/build_1661889017/heritrix-3.4.0-SNAPSHOT/jobs/clashroyale/\$\{launchId\}/reports/

# drwxrwxr-x 2 cgarcia cgarcia 4096 ago 31 11:31 .
# drwxrwxr-x 3 cgarcia cgarcia 4096 ago 31 11:31 ..
# -rw-rw-r-- 1 cgarcia cgarcia  280 ago 31 11:31 crawl-report.txt
# -rw-rw-r-- 1 cgarcia cgarcia    0 ago 31 11:31 seeds-report.txt
# -rw-rw-r-- 1 cgarcia cgarcia   13 ago 31 11:31 threads-report.txt
cgr71ii commented 1 year ago

It seems that the part of the configuration that is being affected by this issue is:

<bean id="statisticsTracker" class="org.archive.crawler.reporting.StatisticsTracker" autowire="byName">
  <!-- ... -->
  <!-- <property name="reportsDir" value="${launchId}/reports" /> -->
  <!-- ... -->
</bean>