returneksibir / yakala

a humble web crawler framework
3 stars 1 forks source link

Invalid extracted links #22

Closed rimbi closed 12 years ago

rimbi commented 12 years ago

I have encountered this error on Idefix spider outputs. Some of the links seem to be incorrect. The problem is that the valid link is ,let's say, as follows:

http://idefix.com/....X4I

but it's somehow kept in lower case in the application as follws:

http://idefix.com/....x4i

And because of that link becomes invalid.

I'm not sure if the link is invalid on page or we make it invalid by transforming to lower case letters (That's possible beacuse the links might be being transformed to lower case letters during the checking of already visited links in order to prevent the comparison of links from failing because of the case difference in the link)

rimbi commented 12 years ago

I know I know. But there seems to be something wrong with the conversion to lower case. When I clicked the invalid links I see that there was no such page on the idefix.com site whereas the original link, the one before the conversion, works fine.

On Fri, Oct 14, 2011 at 8:30 AM, Sinan Nalkaya < reply@reply.github.com>wrote:

cemo, this doesn't make sense at all.

it doesn't matter for domain names to be used in lowercase or upper case, or mixed. they all are same.

On Thu, Oct 13, 2011 at 9:33 PM, Cem Eliguzel reply@reply.github.com wrote:

I have encountered this error on Idefix spider outputs. Some of the links seem to be incorrect. The problem is that the valid link is ,let's say, as follows:

http://idefix.com/....X4I

but it's somehow kept in lower case in the application as follws:

http://idefix.com/....x4i

And because of that link becomes invalid.

I'm not sure if the link is invalid on page or we make it invalid by transforming to lower case letters (That's possible beacuse the links might be being transformed to lower case letters during the checking of already visited links in order to prevent the comparison of links from failing because of the case difference in the link)

Reply to this email directly or view it on GitHub: https://github.com/returneksibir/yakala/issues/22

Reply to this email directly or view it on GitHub: https://github.com/returneksibir/yakala/issues/22#issuecomment-2403503