Open justinlittman opened 8 years ago
From @dchud:
- big heritrix crawls -- and, perhaps, not deduping, or both -- are a bottleneck. there are surely tweets with attached media and linked content in my timeline collections that will be gone by the time heritrix catches up. adding a big list of accounts up front always takes a long while, so later incremental follow-ups take a while to process. this might be a result of running user timelines every hour.. will have to play with that. but a second heritrix process for recent/small batches might help.
@liuqingli also encountered problems with not being able to scale the web harvesters.
Every web harvester container must have a heritrix container. This is currently done by simple linking. However, this probably won't work well with
docker-compose scale
, as the 1:1 pairing won't occur.Possible approaches to fixing: