trivio / common_crawl_index

Index URLs in Common Crawl
192 stars 48 forks source link

reverse hostname transformation breaks urls with username:password@domain.com #19

Open keiw opened 11 years ago

keiw commented 11 years ago

Right now CC contains some urls with username:password in them. On index creation they were transformed by function reversehost() and as a result they are not searchable.

reversehost('http://123456:654321@www.lesbo101.com/') 
== '123456/:654321@www.lesbo101.com:http'

reversehost('http://Dennis:Reggie@www.sanafey.com/members/index.shtml') 
== 'Dennis/members/index.shtml:Reggie@www.sanafey.com:http'
tfmorris commented 10 years ago

This is basically the same as #12