internetarchive / heritrix3

Heritrix is the Internet Archive's open-source, extensible, web-scale, archival-quality web crawler project.
https://heritrix.readthedocs.io/
Other
2.78k stars 757 forks source link

Disk Space Usage #291

Closed qome closed 1 year ago

qome commented 4 years ago

What are some strategies to reduce disk usage? I am crawling a specific forum starting at its home page and re-starting crawls through the REST API once per hour. I had hoped this would reduce the amount of unchanged, old pages and posts I was archiving. I still want to keep exploring ways to reduce disk usage.

anjackson commented 4 years ago

The first thing you could try is to set up your crawl to use some form of de-duplication, where binary-identical content is recognised and stored as WARC revisit records rather than being stored multiple times.

Having said that, I'm not sure how clear the documentation is. We may need to dig out some examples.

qome commented 4 years ago

I have now set up CDX-based de-duplication which has worked out great. I am using wget to grab data now because I want single pages instead of long crawls and it's saving me anywhere from 33-75% depending on how static-heavy the pages are. I apologize for not having a direct answer to the Heritrix question but CDX-based de-duplication may be, in some way, possible in Heritrix too and I would encourage people who have this question to look into it.