trivio / common_crawl_index

Index URLs in Common Crawl
193 stars 48 forks source link

Docs should mention that urls are stored in revers hostname order. #5

Open srobertson opened 11 years ago

keiw commented 11 years ago

Speaking about reverse hostname ordere and reversehost() function how this one ua.com.book-hunter.www/book/view/231/page:16:http should be interpreted?

Precisely, what does 16 mean here?

I ask because

reversehost('http://www.book-hunter.com.ua/book/view/231/page:16') 
== reversehost('http://www.book-hunter.com.ua:16/book/view/231/page')

which makes such revers hostname order ambiguous and reversehost() procedure not fully invertible to the actual URL. Please, correct me if i'm wrong or missing something obvious here.

So should we file a separate issue for that? AFAICT ability to get unambiguous URL makes perfect sense here and would be expected.

keiw commented 11 years ago

I filed a separate issue #12