internetarchive / heritrix3

Heritrix is the Internet Archive's open-source, extensible, web-scale, archival-quality web crawler project.
https://heritrix.readthedocs.io/
Other
2.81k stars 764 forks source link

dnsjava NIO selector thread stuck at 100% after terminating job #425

Closed jmvezic closed 2 years ago

jmvezic commented 3 years ago

Using the latest version (20210803) and a lot of versions before that, when the job is terminated, one CPU thread seems to be stuck at 100% doing nothing. This never goes away until I restart Heritrix.

For reference, this doesn't happen with version 20200304, for example. I haven't tried all versions, so I don't know when this problem started. There's also nothing in the logs that would indicate something is wrong.

Using default crawler-beans with a set operator URL and any seed you like.

jmvezic commented 2 years ago

Update: this bug was introduced with version 3.4.0-20210617, and is present in all versions after that

ato commented 2 years ago

Confirming I can reproduce this in 20210803. Hitting shift-H in top shows it's the dnsjava NIO selector thread. Here's the stack trace (from jstack <pid>):

"dnsjava NIO selector" #67 daemon prio=4 os_prio=0 cpu=1014221.96ms elapsed=1060.26s tid=0x00007fd000011800 nid=0x90e27 runnable  [0x00007fd0e07bf000]
   java.lang.Thread.State: RUNNABLE
    at sun.nio.ch.EPoll.wait(java.base@11.0.12/Native Method)
    at sun.nio.ch.EPollSelectorImpl.doSelect(java.base@11.0.12/EPollSelectorImpl.java:120)
    at sun.nio.ch.SelectorImpl.lockAndDoSelect(java.base@11.0.12/SelectorImpl.java:124)
    - locked <0x00000000f41af7b8> (a sun.nio.ch.Util$2)
    - locked <0x00000000f41af558> (a sun.nio.ch.EPollSelectorImpl)
    at sun.nio.ch.SelectorImpl.select(java.base@11.0.12/SelectorImpl.java:136)
    at org.xbill.DNS.Client.runSelector(Client.java:67)
    at org.xbill.DNS.Client$$Lambda$308/0x00000001004ec840.run(Unknown Source)
    at java.lang.Thread.run(java.base@11.0.12/Thread.java:829)
ato commented 2 years ago

Poking this with a debugger a bit it appears select returns immediately because the thread was interrupted. dnsjava's runSelector() code never clears the interrupted flag so it just busy loops calling select. Looks like dnsjava NIO selector ends up in in ToePool.getToes() which presumably means ToePool.shutdown() is interrupting it.

One workaround might be to have ToePool check the thread name and exclude it from interrupting.

As the dnsjava selector thread is global per process it seems wrong that it ends up in the ToePool thread group at all. So perhaps it'd be better to prevent it from being assigned to the group in the first place. I guess one way to do this would be to do a dummy lookup on startup from a thread that's not in a group.