sul-dlss / web-archiving

placeholder for web archiving work
0 stars 0 forks source link

Should we be deleting files in web-archiving-stacks/data/indicies/cdx_working? #18

Closed drh-stanford closed 7 years ago

drh-stanford commented 7 years ago

Also cdx_backup?

[lyberadmin@wayback-prod indices]$ du -hs *
113G    cdx
110G    cdx_backup
319G    cdx_working
35M path
4.0K    path_working
drh-stanford commented 7 years ago

Also see https://github.com/sul-dlss/was_robot_suite/issues/21

drh-stanford commented 7 years ago

The majority of space appears to have been from failed process?

-rw-r--r-- 1 lyberadmin lyberteam  84M Dec  6 14:20 pf139tj8228_one_step_u_merged.cdx_3
-rw-r--r-- 1 lyberadmin lyberteam 106G Dec  6 17:33 pf139tj8228_sorted_index.cdx_1
-rw-r--r-- 1 lyberadmin lyberteam 106G Dec  8 22:56 pf139tj8228_sorted_index.cdx_2
-rw-r--r-- 1 lyberadmin lyberteam 106G Dec  9 02:08 pf139tj8228_sorted_index.cdx_3
ndushay commented 7 years ago

the pf139tj8228_ files were from my statistics gathering, and can be deleted

ndushay commented 7 years ago

from #21, @drh-stanford says:

The CDXMergeSortPublishService does not handle error conditions when doing the sort/merge/publish. On error of the sorts, the cdx_working/ folder does not get cleaned up. And on success, the individual CDX files are moved into cdx_backup/ but are never cleaned up.

Since these files are strictly derivatives and we can re-generate them from the WARC files, I suggest that we have a cronjob that goes through these folders and removes any that are "old" (a month?). Otherwise, the space requirements double for the indices folders.

ndushay commented 7 years ago

closing in favor of #21