Closed GoogleCodeExporter closed 9 years ago
Class edu.uci.ics.crawler4j.util.Util#hasPlainTextContent checks for
"text/plain" when it should actually check for "text/" (not html).
I'd suggest replacing the contains with a matches call for performance.
In my current version I have replaced the line 82 with:
if (typeStr.contains("text/plain") || typeStr.contains("text/xml")) {
for simplicity sake.
Original comment by panthro....@gmail.com
on 16 Nov 2014 at 5:14
Thank you Rafael.
If you look at Wikipedia fopr the list of text media types, you will find many
of those.
So in order to support all of them and in order to support future text media
types I am using "contains"
The current line of code is:
typeStr.contains("text") && !typeStr.contains("html")
Original comment by avrah...@gmail.com
on 16 Nov 2014 at 5:23
As this method is called often and often, for performance I'd suggest changing
to a
matches("text\/(?!html)");
you can see the test in here: http://www.regexr.com/39tnl
Original comment by panthro....@gmail.com
on 16 Nov 2014 at 5:35
Tested it and it works.
The DX sitemap is getting crawled.
The default FILTERS isn't allowing the crawler to crawl XML
So I removed that one from the shouldVisit and it crawls the XML nicely.
Can you please recheck, what do you think is the exact problem there ?
Original comment by avrah...@gmail.com
on 16 Nov 2014 at 5:48
yeah, I just saw that the line of code from my code is different from the
master.
I must have messed up while testing/debugging the real error, my bad.
Original comment by panthro....@gmail.com
on 16 Nov 2014 at 5:56
yeah, I just saw that the line of code from my code is different from the
master.
I must have messed up while testing/debugging the real error, my bad.
you can invalidate it, sorry for wasting your time.
Original comment by panthro....@gmail.com
on 16 Nov 2014 at 5:57
No problem, you lit up several issues here and done great work.
Keep the issues comin!
Original comment by avrah...@gmail.com
on 16 Nov 2014 at 5:59
Original issue reported on code.google.com by
panthro....@gmail.com
on 16 Nov 2014 at 5:11