internetarchive / heritrix3

Heritrix is the Internet Archive's open-source, extensible, web-scale, archival-quality web crawler project.
https://heritrix.readthedocs.io/
Other
2.77k stars 757 forks source link

Question about memory usage #462

Closed naveen17797 closed 2 years ago

naveen17797 commented 2 years ago

i have around 2-3 jobs running everyday, when the jobs run the heap usage increases, once the jobs finish the heap usage retains the increased value. i am using a fork of heritrix, so i wanted to understand if this is the default behaviour, should the memory be freed after the job is complete? or its the default behaviour to retain the memory usage ? I

2022-02-07_17-49

anjackson commented 2 years ago

This is the expected behaviour for any Java program, as the memory will only be freed when the garbage collection runs. What should happen is that if you press the blue run garbage collector button, the heap should drop back down to close to the original value. There are cases where not all memory can be freed, so it might not go the the exact same value, but it should be fairly close.

If you find this is not the case, feel free to reopen this issue.

naveen17797 commented 2 years ago

Hi @anjackson thanks for the answer, what if i wanted to understand is if the heritrix has inbuilt job scheduling considering the heap usage, Right now when i am trying to run multiple jobs at once some jobs gets stopped with out of memory exception, will they be automatically restarted by heritrix or is there any configuration in heritrix where i can specify how many jobs to run concurrently ?

anjackson commented 9 months ago

@naveen17797 Heritrix has no in-built method for job scheduling at all. The number of jobs running and the resource usage is expected to be managed by the operator. Sorry for the late delivery of bad news!