Closed qome closed 1 year ago
The first thing you could try is to set up your crawl to use some form of de-duplication, where binary-identical content is recognised and stored as WARC revisit
records rather than being stored multiple times.
Having said that, I'm not sure how clear the documentation is. We may need to dig out some examples.
I have now set up CDX-based de-duplication which has worked out great. I am using wget to grab data now because I want single pages instead of long crawls and it's saving me anywhere from 33-75% depending on how static-heavy the pages are. I apologize for not having a direct answer to the Heritrix question but CDX-based de-duplication may be, in some way, possible in Heritrix too and I would encourage people who have this question to look into it.
What are some strategies to reduce disk usage? I am crawling a specific forum starting at its home page and re-starting crawls through the REST API once per hour. I had hoped this would reduce the amount of unchanged, old pages and posts I was archiving. I still want to keep exploring ways to reduce disk usage.