sujit-kr / crawler4j

Automatically exported from code.google.com/p/crawler4j
0 stars 0 forks source link

Crawler crawls javascript and css files with ? at the end of the url #80

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
I ran into an issue with the crawler. First it works really fine and is easy to 
learn but I am not able to stop crawling javascript or css files which extend 
there url with some additional vars and parameter. For example:

this works fine
url/folder/blabla.css

but not this one
url/folder/blabla.css?additional=true

Is this easy to fix that those pages not get crawled?

Anyway thx for this awesamone programm and sharing your work with us.

Greetings Frank

Original issue reported on code.google.com by frank.ro...@gmail.com on 17 Sep 2011 at 6:47

GoogleCodeExporter commented 9 years ago
I think this is very easy to prevent. In your shoudVisit function, you can 
parse the URL to remove the query string and then check to see if it ends with 
.css or .js, ...

-Yasser

Original comment by ganjisaffar@gmail.com on 18 Sep 2011 at 6:59

GoogleCodeExporter commented 9 years ago
    String str_url  = "url/folder/blabla.css?additional=true";
    int idx_last_slash = str_url.lastIndexOf("/");
    str_url = str_url.substring(0, str_url.indexOf("?",idx_last_slash));

    if(str_url.toLowerCase().endsWith(".css") || str_url.toLowerCase().endsWith(".js") || str_url.toLowerCase().endsWith(".xml")) 
    {
      //do nothing
    }
    else
    {
      // write code you want below...

    }

Original comment by skwdm...@gmail.com on 18 Sep 2011 at 6:53

GoogleCodeExporter commented 9 years ago
Thx for your fast reply. I solved this problem in the following way.

    Pattern filters = Pattern.compile(".*(\\.(css|js|bmp|gif|jpe?g"
            + "|png|tiff?|mid|mp2|mp3|mp4" + "|wav|avi|mov|mpeg|ram|m4v|pdf"
            + "|rm|smil|wmv|swf|wma|zip|rar|gz)(|\\?.*))$");

As you can see I just add (|\\?.*) and it works for me, but anyway thx for your 
help and suggestions!

Original comment by frank.ro...@gmail.com on 19 Sep 2011 at 2:01