mohankreddy / crawler4j

Automatically exported from code.google.com/p/crawler4j
0 stars 0 forks source link

Unexpected behavior of URLCanonicalizer.getCanonicalURL(href, context) #150

Open GoogleCodeExporter opened 9 years ago

GoogleCodeExporter commented 9 years ago
What steps will reproduce the problem?
What is the expected output? What do you see instead?
1.
public class URLCanonicalizerIntegrationTest
{
    @Test
    public void givenHrefAndContext_whenGetCanonicalURL_thenReturnCorrectURL()
    {
        String url = URLCanonicalizer.getCanonicalURL("/path1/path2/path3A path3B/", "http://www.example.com/path1/path2/");
        assertThat(url, equalTo("http://www.example.com/path1/path2/path3A path3B/"));
    }
}

What version of the product are you using?
    <dependency>
            <groupId>edu.uci.ics</groupId>
            <artifactId>crawler4j</artifactId>
            <version>3.3</version>
        </dependency>

Please provide any additional information below.

Original issue reported on code.google.com by weicheng...@gmail.com on 10 May 2012 at 4:00

GoogleCodeExporter commented 9 years ago
Expected output: http://www.example.com/path1/path2/path3A path3B/, but was 
null.

Original comment by weicheng...@gmail.com on 10 May 2012 at 4:13

GoogleCodeExporter commented 9 years ago

Original comment by avrah...@gmail.com on 18 Aug 2014 at 3:18

GoogleCodeExporter commented 9 years ago
I'm seeing this same issue.  Websites that have outgoing links with spaces in 
the file name do not show up in 

List<WebURLs> urls = htmlParseData.getOutgoingUrls();

Original comment by cratervo...@gmail.com on 7 Nov 2014 at 9:19

GoogleCodeExporter commented 9 years ago
What will greatly help me is a good example replicating this scenario

Best if you can find an actual real site with this behaviour

Original comment by avrah...@gmail.com on 9 Nov 2014 at 8:22