Sitemaps with content-type text/xml are ignored

smasher125354 / crawler4j

Automatically exported from code.google.com/p/crawler4j

0 stars 0 forks source link

Sitemaps with content-type text/xml are ignored #316

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago

What steps will reproduce the problem?
1. create a simple crawler
2. point to http://dx.com/sitemap.xml (http header content-type is text/xml)
3. start the crawler

What is the expected output? What do you see instead?
the sitemap.xml to be parsed and links crawled.

What version of the product are you using?
latest from master/trunk

Please provide any additional information below.
not needed

Original issue reported on code.google.com by panthro....@gmail.com on 16 Nov 2014 at 5:11

GoogleCodeExporter commented 9 years ago

Class edu.uci.ics.crawler4j.util.Util#hasPlainTextContent checks for 
"text/plain" when it should actually check for "text/" (not html).

I'd suggest replacing the contains with a matches call for performance.

In my current version I have replaced the line 82 with:

if (typeStr.contains("text/plain") || typeStr.contains("text/xml")) {

for simplicity sake.

Original comment by panthro....@gmail.com on 16 Nov 2014 at 5:14

GoogleCodeExporter commented 9 years ago

Thank you Rafael.

If you look at Wikipedia fopr the list of text media types, you will find many 
of those.

So in order to support all of them and in order to support future text media 
types I am using "contains"

The current line of code is:
typeStr.contains("text") && !typeStr.contains("html")

Original comment by avrah...@gmail.com on 16 Nov 2014 at 5:23

Changed state: Accepted

GoogleCodeExporter commented 9 years ago

As this method is called often and often, for performance I'd suggest changing 
to a 
matches("text\/(?!html)");

you can see the test in here: http://www.regexr.com/39tnl

Original comment by panthro....@gmail.com on 16 Nov 2014 at 5:35

GoogleCodeExporter commented 9 years ago

Tested it and it works.

The DX sitemap is getting crawled.

The default FILTERS isn't allowing the crawler to crawl XML
So I removed that one from the shouldVisit and it crawls the XML nicely.

Can you please recheck, what do you think is the exact problem there ?

Original comment by avrah...@gmail.com on 16 Nov 2014 at 5:48

GoogleCodeExporter commented 9 years ago

yeah, I just saw that the line of code from my code is different from the 
master.

I must have messed up while testing/debugging the real error, my bad.

Original comment by panthro....@gmail.com on 16 Nov 2014 at 5:56

GoogleCodeExporter commented 9 years ago

yeah, I just saw that the line of code from my code is different from the 
master.

I must have messed up while testing/debugging the real error, my bad.

you can invalidate it, sorry for wasting your time.

Original comment by panthro....@gmail.com on 16 Nov 2014 at 5:57

GoogleCodeExporter commented 9 years ago

No problem, you lit up several issues here and done great work.

Keep the issues comin!

Original comment by avrah...@gmail.com on 16 Nov 2014 at 5:59

Changed state: Invalid