ukwa / webarchive-discovery

WARC and ARC indexing and discovery tools.
https://github.com/ukwa/webarchive-discovery/wiki
116 stars 25 forks source link

Clean up temporary files underway #252

Open tokee opened 3 years ago

tokee commented 3 years ago

It seems that calling warc-indexer with thousands of WARC-files causes the tmp folder to fill up (maybe due to DROID temporary files). It should possible to clean up underway.

anjackson commented 3 years ago

I think this likely relates to this issue: https://github.com/openpreserve/nanite/pull/36

Unfortunately, the pull was full of whitespace changes and I couldn't work out what was happening. I'll have to try and fix it up.

anjackson commented 3 years ago

Hm, also https://github.com/openpreserve/nanite/pull/40 and this part of the code seems to be a bit of a mess as those two pulls were a bit out of sync, so I'll try to tidy up.

anjackson commented 3 years ago

Well, that was messy, but I think the Nanite code is better now. Just released 1.4.1-97 and will update this project when it becomes available.

anjackson commented 3 years ago

Actually lets leave this open until we've proved the Nanite update resolved the issue.

anjackson commented 1 year ago

Note that Tika < 1.25 has also been reported as generating a lot of tmp files (https://issues.apache.org/jira/browse/TIKA-3203) so that might also be the issue. I've updated to 1.28.5 and I'm looking at getting to Tika 2.7.