ukwa / webarchive-discovery

WARC and ARC indexing and discovery tools.
https://github.com/ukwa/webarchive-discovery/wiki
117 stars 25 forks source link

Close RandomAccessFile connection to the cacheFile upon cleanup #172

Closed tokee closed 6 years ago

tokee commented 6 years ago

Whenever the content of a WARC-resource exceeds the in-memory threshold, it is flushed to temporary storage before being passed in to Tika, in the form of a RandomAccessFile (RAF). When processing of the resource has finished, the temporary file is deleted.

The problem here is that the RAF is not closed. Under most Linux file systems files can be marked as deleted, but the structure on storage lives as long as there is a reference to it. The RAF is such a reference. The WARCIndexer only holds a single RAF-reference at a time, leaving the old one to be garbage collected and consequently to sever the connection to the cache file on storage, leading to its structure to be freed from storage. But there is no guarantee as to when this happens, so for us this caused a build-up of "deleted" temporary files, filling /tmp. Running the tool lsof while indexing a pseudo-random WARC file gave us

java    48381  dsc   22r   REG              253,2   9013962         206 /tmp/warc-indexer3311138792711259771.cache (deleted)                     
java    48381  dsc   24r   REG              253,2   1421443         207 /tmp/warc-indexer7696636648013753028.cache (deleted)                     
java    48381  dsc   25r   REG              253,2   2259122         208 /tmp/warc-indexer5142344726103965568.cache (deleted)                     
java    48381  dsc   26r   REG              253,2   3831800         209 /tmp/warc-indexer2012538349600633244.cache (deleted)                     
java    48381  dsc   27r   REG              253,2   9074195         211 /tmp/warc-indexer2536124259283745078.cache (deleted)                     
java    48381  dsc   28r   REG              253,2   1057565         212 /tmp/warc-indexer7581816373325684343.cache (deleted)                     
java    48381  dsc   29r   REG              253,2   2480226         213 /tmp/warc-indexer2709051687788986517.cache (deleted)                     
java    48381  dsc   30r   REG              253,2   2747973         214 /tmp/warc-indexer5638164968725676591.cache (deleted)                     
java    48381  dsc   31r   REG              253,2   1857586         215 /tmp/warc-indexer3689572671063310333.cache (deleted)                     
java    48381  dsc   32r   REG              253,2   2583860         216 /tmp/warc-indexer6082754424301938833.cache (deleted)                     
java    48381  dsc   33r   REG              253,2   1124955         217 /tmp/warc-indexer4388487660065495653.cache (deleted)                     
java    48381  dsc   34r   REG              253,2   1276131         219 /tmp/warc-indexer939663438074756813.cache (deleted)                      
java    48381  dsc   35r   REG              253,2   1136015         220 /tmp/warc-indexer2547763440606143316.cache (deleted)                     
java    48381  dsc   36r   REG              253,2   1630251         221 /tmp/warc-indexer5284361522561130084.cache (deleted)                     
java    48381  dsc   37u   REG              253,2        29         222 /tmp/imageio5964330596753495984.tmp                                      
java    48381  dsc   38r   REG              253,2   2150520         223 /tmp/warc-indexer1480110357313680149.cache (deleted)                     
java    48381  dsc   39r   REG              253,2   5147494         224 /tmp/warc-indexer7441772170941942840.cache (deleted)                     
java    48381  dsc   40r   REG              253,2   1362166         225 /tmp/warc-indexer8491318496537278415.cache (deleted)                     
java    48381  dsc   41r   REG              253,2   3427866         226 /tmp/warc-indexer8175151172074937632.cache (deleted)                     
java    48381  dsc   42u   REG              253,2        29         227 /tmp/imageio7765671169274784112.tmp                                      
java    48381  dsc   43r   REG              253,2   2304176         228 /tmp/warc-indexer1455632582755460317.cache (deleted)                     
java    48381  dsc   44r   REG              253,2   1240147         229 /tmp/warc-indexer139955523139767505.cache (deleted)                      
java    48381  dsc   45u   REG              253,2        29         230 /tmp/imageio8225065622355706934.tmp                                      
java    48381  dsc   46u   REG              253,2        29         231 /tmp/imageio2017012681062907779.tmp                                      
java    48381  dsc   47u   REG              253,2        29         232 /tmp/imageio2797082385182266845.tmp                                      
java    48381  dsc   48u   REG              253,2        29         233 /tmp/imageio1900827043635493676.tmp                                      
java    48381  dsc   49u   REG              253,2        29         234 /tmp/imageio9038476157304415191.tmp                   

which kept growing. Only the 5 non-(deleted) files in the output above are active. The rest is waiting for the garbage collector to come by.

The fix is simple: Close the RAF before the cache file is marked as deleted. This pull request does that.

anjackson commented 6 years ago

Thanks, guess we happened to avoid this by the way we Map-Reduce.