serp-spider / core

:spider: The PHP SERP Spider - A search engine scraper
https://serp-spider.github.io/
Other
89 stars 44 forks source link

Implementation of RFC-3986 algorithm for the URL resolve #14

Closed RubtsovAV closed 7 years ago

RubtsovAV commented 8 years ago

Improve of algorithm the URL Resolve. Now is based on the RFC-3986 section 5.2.

RubtsovAV commented 8 years ago

Why failed? I checked it before create commit. Maybe it because I changed the test suit?

How I can see why it failed?

gsouf commented 8 years ago

Click on the detail link: https://travis-ci.org/serp-spider/core/jobs/134313100#L245

About the commit, the method removePathDotSegments is not desired here. The goal of resolve is to offer the new url as it is.

For instance :

resolve should not transform the url. This would be the job of some realpath-like method.

RubtsovAV commented 8 years ago

I wrote resolve method based on RFC. If you think it not right, can you fix it?

(And probably need to create the issue in RFC? Or I misunderstood it?)

gsouf commented 8 years ago

I'm still thinking. Though it is not common, it can be confusing in extreme cases to remove dots (./ and ../), but in most of cases it wont give a difference. I can't imagine a case it would be a problem, but in some case you might want to keep the full path (with dots)

Well I'm thinking of adding a third parameter that would control simplification of the path (removing dot links) and that would be enabled by default .

RubtsovAV commented 8 years ago

I think it will be a excess, but if you think it necessary, I can do it.

gsouf commented 8 years ago

What I'm talking about is the optional normalization as described here: https://tools.ietf.org/html/rfc3986#section-6.2.2

RubtsovAV commented 8 years ago

Or we can add this option if it is really necessary?

gsouf commented 8 years ago

Well, you are right, I dont like the idea of a third parameter as well.

I think it could be very helpful, for url comparison, to have an additional method (normalize) that would do that.

In a regular workflow, if you want to resolve and normalize an url, that would give:

$resolvedAndNormalizedUrl = $url->resolve("someUrl")->normalize();

In this case resolve would never normalize the url, and things remains clear for everyone

Opening an issue for that.

gsouf commented 7 years ago

I'm closing the PR because it's done differently in #22 I took a few inputs from your PR, thanks for your contribution