Closed drh-stanford closed 7 years ago
The majority of space appears to have been from failed process?
-rw-r--r-- 1 lyberadmin lyberteam 84M Dec 6 14:20 pf139tj8228_one_step_u_merged.cdx_3
-rw-r--r-- 1 lyberadmin lyberteam 106G Dec 6 17:33 pf139tj8228_sorted_index.cdx_1
-rw-r--r-- 1 lyberadmin lyberteam 106G Dec 8 22:56 pf139tj8228_sorted_index.cdx_2
-rw-r--r-- 1 lyberadmin lyberteam 106G Dec 9 02:08 pf139tj8228_sorted_index.cdx_3
the pf139tj8228_
files were from my statistics gathering, and can be deleted
from #21, @drh-stanford says:
The CDXMergeSortPublishService does not handle error conditions when doing the sort/merge/publish. On error of the sorts, the cdx_working/ folder does not get cleaned up. And on success, the individual CDX files are moved into cdx_backup/ but are never cleaned up.
Since these files are strictly derivatives and we can re-generate them from the WARC files, I suggest that we have a cronjob that goes through these folders and removes any that are "old" (a month?). Otherwise, the space requirements double for the indices folders.
closing in favor of #21
Also cdx_backup?