Whenever the content of a WARC-resource exceeds the in-memory threshold, it is flushed to temporary storage before being passed in to Tika, in the form of a RandomAccessFile (RAF). When processing of the resource has finished, the temporary file is deleted.
The problem here is that the RAF is not closed. Under most Linux file systems files can be marked as deleted, but the structure on storage lives as long as there is a reference to it. The RAF is such a reference. The WARCIndexer only holds a single RAF-reference at a time, leaving the old one to be garbage collected and consequently to sever the connection to the cache file on storage, leading to its structure to be freed from storage. But there is no guarantee as to when this happens, so for us this caused a build-up of "deleted" temporary files, filling /tmp. Running the tool lsof while indexing a pseudo-random WARC file gave us
Whenever the content of a WARC-resource exceeds the in-memory threshold, it is flushed to temporary storage before being passed in to Tika, in the form of a
RandomAccessFile
(RAF). When processing of the resource has finished, the temporary file is deleted.The problem here is that the RAF is not closed. Under most Linux file systems files can be marked as deleted, but the structure on storage lives as long as there is a reference to it. The RAF is such a reference. The
WARCIndexer
only holds a single RAF-reference at a time, leaving the old one to be garbage collected and consequently to sever the connection to the cache file on storage, leading to its structure to be freed from storage. But there is no guarantee as to when this happens, so for us this caused a build-up of "deleted" temporary files, filling/tmp
. Running the toollsof
while indexing a pseudo-random WARC file gave uswhich kept growing. Only the 5 non-
(deleted)
files in the output above are active. The rest is waiting for the garbage collector to come by.The fix is simple: Close the RAF before the cache file is marked as deleted. This pull request does that.