lidoapps / crawler4j

Automatically exported from code.google.com/p/crawler4j
0 stars 0 forks source link

url with "@" won't be crawled #204

Open GoogleCodeExporter opened 9 years ago

GoogleCodeExporter commented 9 years ago
I had a few pages link to URLs contains symbol "@", due to a bug in  
edu.uci.ics.crawler4j.parser.Parser.java, it will filter out those links.

example links:

http://www.abs.gov.au/AUSSTATS/abs@.nsf/ProductsbyCatalogue/01776D8D7F87E4E2CA25
709900222520?OpenDocument

The code says:

 if (!hrefWithoutProtocol.contains("javascript:") 
                                        &&    !hrefWithoutProtocol.contains("mailto:")
                                        &&   !hrefWithoutProtocol.contains("@")) 

Original issue reported on code.google.com by huanghui...@gmail.com on 12 Mar 2013 at 5:56

GoogleCodeExporter commented 9 years ago
Will fix this in next release.

-Yasser

Original comment by ganjisaffar@gmail.com on 12 Mar 2013 at 6:07

GoogleCodeExporter commented 9 years ago

Original comment by avrah...@gmail.com on 11 Aug 2014 at 2:33

GoogleCodeExporter commented 9 years ago

Original comment by avrah...@gmail.com on 18 Aug 2014 at 3:34