serp-spider / core

:spider: The PHP SERP Spider - A search engine scraper
https://serp-spider.github.io/
Other
89 stars 44 forks source link

Url normalization #15

Closed gsouf closed 7 years ago

gsouf commented 8 years ago

As talked in #14 it would be useful to have url normalization as described here: https://tools.ietf.org/html/rfc3986#section-6.2.2

The following resources look perfect for that:

@RubtsovAV You were talking about zend uri, but I told that using a full uri abstraction library had too many drawbacks and limitations for SERPS. Please check at sabre/uri, it does not provide a full uri implementation, instead it gives some tools to manipulate uri. Looks perfect for that.

RubtsovAV commented 8 years ago

I'm not sure understand you want. You have decided to replace the URL class?

gsouf commented 8 years ago

Not replacing, that's an addition to make its management easier

RubtsovAV commented 8 years ago

Or you want to copy of some parts from this class? The method removeDotSegment()? Or you want full normalize from that class?

gsouf commented 8 years ago

I will probably add it as a dependency

RubtsovAV commented 8 years ago

So, and what about the test suite?

gsouf commented 8 years ago

What's the relation ?

RubtsovAV commented 8 years ago

Will we need the test from PR https://github.com/serp-spider/core/pull/14 ?

gsouf commented 8 years ago

Yes tests from #14 will be needed to make sure everything is fine, but most of the implementation you proposed will be replaced by this

RubtsovAV commented 8 years ago

And we will need to the split it test on two: resolve and normalize, right?

gsouf commented 8 years ago

Yes, exactly!

gsouf commented 8 years ago

Well, after analysing how sabre/uri::normalize works, it will be lighter in term of performances to extract the code from it to integrate it in serps. The reason is sabre/uri::normalize parses the uri in an array of items, then it normalizes each part, then it build back the url into a string. SERPS actually only needs the intermediate part (normalizing) because everything is already split in parts, and we need to parse it again after, making double processing for the final output

gsouf commented 8 years ago

sabre/uri also removes .. and . when resolving uri

RubtsovAV commented 8 years ago

And I think be better if we remove the second argument of the resolve method and place it behavior in the method buildUrl. How you think?

gsouf commented 8 years ago

That's an other story, but no, because resolve is intensively used internally, and it's here for performance purposes

RubtsovAV commented 8 years ago

It's strange, because this not provide performance benefits. This method is now performs double work and it would be logical to bring the behavior to a more appropriate place for this.

RubtsovAV commented 8 years ago

Besides, you said it was no problem to change the behavior of the method, because library yet not released.

gsouf commented 8 years ago

I didnt place it by accident, believe me (and as i said, it's not the same topic as the issue's)

gsouf commented 7 years ago

Implemented in master for v0.2.0