jungjonghun / crawler4j

Automatically exported from code.google.com/p/crawler4j
0 stars 0 forks source link

Resuming Enabled Large Seed List Takes Forever #95

Open GoogleCodeExporter opened 9 years ago

GoogleCodeExporter commented 9 years ago
What steps will reproduce the problem?
1.
2.
3.

What is the expected output? What do you see instead?

What version of the product are you using? On what operating system?

Please provide any additional information below.

I have a text file containing 1.2 million urls.  I am simply trying to inject 
them into the controller.addseed.  If I have resume disabled this works quickly 
and my threads start crawling the Internet after a couple seconds.  But with 
resume enabled it has been injected urls for the past 20 minutes.  Is there 
anyway to speed up the JDB creation process?  It's not a cpu/ram bottle neck as 
they are barely being used.

Original issue reported on code.google.com by chrstah...@gmail.com on 18 Nov 2011 at 5:01

GoogleCodeExporter commented 9 years ago
I estimate it will take between 3-4 hours to inject all the url's

Original comment by chrstah...@gmail.com on 18 Nov 2011 at 6:18

GoogleCodeExporter commented 9 years ago
I got it down from 3 hours to 1 hour 40 min by changing to no sync commit, but 
any advice on getting it faster would be helpful.

Original comment by chrstah...@gmail.com on 21 Nov 2011 at 8:06

GoogleCodeExporter commented 9 years ago

Original comment by avrah...@gmail.com on 18 Aug 2014 at 3:10