Time is not stopped when Disk Space Monitor is triggered and report files are removed

cgr71ii commented 2 years ago

Hi!

I'm crawling with the Disk Space Monitor enabled:

 <bean id="diskSpaceMonitor" class="org.archive.crawler.monitor.DiskSpaceMonitor">
   <property name="pauseThresholdMiB" value="5000" />
   <property name="monitorConfigPaths" value="true" />
   <property name="monitorPaths">
     <list>
      <value>/</value>
     </list>
   </property>
 </bean>

Sadly for me, it was triggered, but I noticed that the crawl time didn't stop, and I had another configuration in order to stop the crawl in 1 week:

 <bean id="crawlLimiter" class="org.archive.crawler.framework.CrawlLimitEnforcer">
  <!-- <property name="maxBytesDownload" value="0" /> -->
  <!-- <property name="maxDocumentsDownload" value="0" /> -->
  <property name="maxTimeSeconds" value="604800" /> <!-- Crawl for a week -->
 </bean>

I guess that if the (elapsed) time carries on, it will stop the crawl when the week is reached (I haven't made any further tests, so I'm not sure). Shouldn't the time stop in order to fix, if possible, the disk issue and then carry on with the crawl? Maybe I'm wrong and the elapsed time which is showed in the UI web is just for statistics and is not the same time for stopping the crawl. In that case, sorry for the misunderstanding :/

cgr71ii commented 2 years ago

By the way, when this happens, I've noticed that hosts-report.txt, seeds-report.txt, ... are empty. I guess this is because when the Disk Space Monitor is triggered, it stops all writes, and this might lead to inconsistent write operations and makes the file to don't finish correctly, but just guessing. Shouldn't the Disk Space Monitor let the log files finish correctly in order to do not lose all statistics?

When I try to open the hosts report in the web UI, the result is:

An error occurred
You may be able to recover and try something else by going [back](javascript:history.back();void(0);).
Cause: com.sleepycat.je.DiskLimitException: (JE 7.5.11) Disk usage is not within je.maxDisk or je.freeDisk limits and write operations are prohibited: maxDiskLimit=0 freeDiskLimit=5,368,709,120 adjustedMaxDiskLimit=0 maxDiskOverage=0 freeDiskShortage=37,150,720 diskFreeSpace=5,331,558,400 availableLogSize=-37,150,720 totalLogSize=234,239,849,739 activeLogSize=234,239,849,739 reservedLogSize=0 protectedLogSize=0 protectedLogSizeMap={}

com.sleepycat.je.DiskLimitException: (JE 7.5.11) Disk usage is not within je.maxDisk or je.freeDisk limits and write operations are prohibited: maxDiskLimit=0 freeDiskLimit=5,368,709,120 adjustedMaxDiskLimit=0 maxDiskOverage=0 freeDiskShortage=37,150,720 diskFreeSpace=5,331,558,400 availableLogSize=-37,150,720 totalLogSize=234,239,849,739 activeLogSize=234,239,849,739 reservedLogSize=0 protectedLogSize=0 protectedLogSizeMap={}
    at com.sleepycat.je.Cursor.checkUpdatesAllowed(Cursor.java:5337)
    at com.sleepycat.je.Cursor.checkUpdatesAllowed(Cursor.java:5314)
    at com.sleepycat.je.Cursor.putInternal(Cursor.java:2410)
    at com.sleepycat.je.Cursor.putInternal(Cursor.java:830)
    at com.sleepycat.je.Cursor.put(Cursor.java:787)
    at com.sleepycat.je.Cursor.put(Cursor.java:885)
    at com.sleepycat.util.keyrange.RangeCursor.put(RangeCursor.java:1055)
    at com.sleepycat.collections.DataCursor.put(DataCursor.java:802)
    at com.sleepycat.collections.StoredContainer.putKeyValue(StoredContainer.java:329)
    at com.sleepycat.collections.StoredMap.put(StoredMap.java:285)
    at org.archive.crawler.reporting.StatisticsTracker$2.execute(StatisticsTracker.java:866)
    at org.archive.modules.fetcher.DefaultServerCache.forAllHostsDo(DefaultServerCache.java:157)
    at org.archive.crawler.reporting.StatisticsTracker.calcReverseSortedHostsDistribution(StatisticsTracker.java:862)
    at org.archive.crawler.reporting.HostsReport.write(HostsReport.java:82)
    at org.archive.crawler.reporting.StatisticsTracker.writeReportFile(StatisticsTracker.java:898)
    at org.archive.crawler.reporting.StatisticsTracker.writeReportFile(StatisticsTracker.java:875)
    at org.archive.crawler.restlet.ReportGenResource.get(ReportGenResource.java:55)
    at org.restlet.resource.ServerResource.doHandle(ServerResource.java:603)
    at org.restlet.resource.ServerResource.doNegotiatedHandle(ServerResource.java:662)
    at org.restlet.resource.ServerResource.doConditionalHandle(ServerResource.java:348)
    at org.restlet.resource.ServerResource.handle(ServerResource.java:1020)
    at org.restlet.resource.Finder.handle(Finder.java:236)
    at org.restlet.routing.Filter.doHandle(Filter.java:150)
    at org.restlet.routing.Filter.handle(Filter.java:197)
    at org.restlet.routing.Router.doHandle(Router.java:422)
    at org.restlet.routing.Router.handle(Router.java:641)
    at org.restlet.routing.Filter.doHandle(Filter.java:150)
    at org.restlet.routing.Filter.handle(Filter.java:197)
    at org.restlet.routing.Filter.doHandle(Filter.java:150)
    at org.restlet.routing.Filter.handle(Filter.java:197)
    at org.restlet.routing.Filter.doHandle(Filter.java:150)
    at org.restlet.engine.application.StatusFilter.doHandle(StatusFilter.java:140)
    at org.restlet.routing.Filter.handle(Filter.java:197)
    at org.restlet.routing.Filter.doHandle(Filter.java:150)
    at org.restlet.routing.Filter.handle(Filter.java:197)
    at org.restlet.engine.CompositeHelper.handle(CompositeHelper.java:202)
    at org.restlet.engine.application.ApplicationHelper.handle(ApplicationHelper.java:77)
    at org.restlet.Application.handle(Application.java:385)
    at org.restlet.routing.Filter.doHandle(Filter.java:150)
    at org.restlet.routing.Filter.handle(Filter.java:197)
    at org.restlet.routing.Filter.doHandle(Filter.java:150)
    at org.restlet.routing.Filter.handle(Filter.java:197)
    at org.restlet.routing.Router.doHandle(Router.java:422)
    at org.restlet.routing.Router.handle(Router.java:641)
    at org.restlet.routing.Filter.doHandle(Filter.java:150)
    at org.restlet.routing.Filter.handle(Filter.java:197)
    at org.restlet.routing.Router.doHandle(Router.java:422)
    at org.restlet.routing.Router.handle(Router.java:641)
    at org.restlet.routing.Filter.doHandle(Filter.java:150)
    at org.restlet.engine.application.StatusFilter.doHandle(StatusFilter.java:140)
    at org.restlet.routing.Filter.handle(Filter.java:197)
    at org.restlet.routing.Filter.doHandle(Filter.java:150)
    at org.restlet.routing.Filter.handle(Filter.java:197)
    at org.restlet.engine.CompositeHelper.handle(CompositeHelper.java:202)
    at org.restlet.Component.handle(Component.java:408)
    at org.restlet.Server.handle(Server.java:507)
    at org.restlet.engine.connector.ServerHelper.handle(ServerHelper.java:63)
    at org.restlet.engine.adapter.HttpServerHelper.handle(HttpServerHelper.java:143)
    at org.restlet.ext.jetty.JettyServerHelper$WrappedServer.handle(JettyServerHelper.java:237)
    at org.eclipse.jetty.server.HttpChannel.lambda$handle$1(HttpChannel.java:388)
    at org.eclipse.jetty.server.HttpChannel.dispatch(HttpChannel.java:633)
    at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:380)
    at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:279)
    at org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:311)
    at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:105)
    at org.eclipse.jetty.io.ssl.SslConnection$DecryptedEndPoint.onFillable(SslConnection.java:540)
    at org.eclipse.jetty.io.ssl.SslConnection.onFillable(SslConnection.java:395)
    at org.eclipse.jetty.io.ssl.SslConnection$2.succeeded(SslConnection.java:161)
    at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:105)
    at org.eclipse.jetty.io.ChannelEndPoint$1.run(ChannelEndPoint.java:104)
    at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.runTask(EatWhatYouKill.java:336)
    at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.doProduce(EatWhatYouKill.java:313)
    at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.tryProduce(EatWhatYouKill.java:171)
    at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.run(EatWhatYouKill.java:129)
    at org.eclipse.jetty.util.thread.ReservedThreadExecutor$ReservedThread.run(ReservedThreadExecutor.java:375)
    at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:779)
    at org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.run(QueuedThreadPool.java:911)
    at java.base/java.lang.Thread.run(Thread.java:829)

ato commented 2 years ago

I think the problem you've run into is that DiskSpaceMonitor is set at 5000 MiB which is lower than the BDB je.freeDisk default of 5 GiB (5120 MiB). So I think what happened is DiskSpaceMonitor didn't pause the crawl before BDB had already thrown a DiskLimitException exception. There's probably a lot of code in Heritrix that can't gracefully recover from a database exception.

It seems like a real gotcha that the default pause threshold is 500 MiB. Perhaps the BDB threshold changed at some point. To try to address this I've increased the default pause threshold to 8 GiB and added a note to the default job profile warning that you need to keep 5 GiB free for BDB.

cgr71ii commented 2 years ago

Oh, so you're saying that there's a hard limit imposed by DBD je.freeDisk which is not configurable? And this limit is 5 GiB, so the DiskSpaceMonitor should be configured with a value higher in order to be able to trigger instead of trigger the DBD exception, right?

ato commented 2 years ago

Yes.

BDB itself (not the Heritrix job config) does have a mechanism for configuring it by editing a file but the BDB documentation implies its set at 5 GiB for a good reason. I haven't looked into it deeply myself but there's some discussion in issue #340.

cgr71ii commented 2 years ago

Oh, ok! I hadn't understood very well the thread when I ran into it. Thank you for the explanation!

internetarchive / heritrix3

Time is not stopped when Disk Space Monitor is triggered and report files are removed #499