Closed GoogleCodeExporter closed 9 years ago
If you're crawling a single domain, I suggest to do this: for each URL that you
receive in shouldVisit, extract the initial part and then generate a CRC32
checksum for it. Keep the checksums in a hashset and for URLs that you get in
shouldVisit make sure that the checksum is not seen before. This way you don't
need to change crawler4j code.
-Yasser
Original comment by ganjisaffar@gmail.com
on 12 Mar 2012 at 3:11
Sorry to resurrect an old post. Yasser can you please give an example of what
you're talking about? I have the same issue and need a solution.
Would it not be better to checksum the whole URL rather than the initial part?
Original comment by ldrive...@gmail.com
on 17 Feb 2013 at 7:32
Original issue reported on code.google.com by
hiren2hi...@gmail.com
on 11 Mar 2012 at 5:36