sytelus / HackerNewsData

All stories and comments posted on Hacker News upto May 29, 2014
http://shitalshah.com/p/downloading-all-of-hacker-news-posts-and-comments/
128 stars 10 forks source link

641,071 IDs unaccounted for #4

Open dw opened 10 years ago

dw commented 10 years ago

Hi there,

Per my Reddit comment at http://uk.reddit.com/r/datasets/comments/26xqgs/downloading_all_of_hacker_news_posts_and_comments/ , there are 641k IDs that don't appear anywhere.

It looks like either your crawler or Algolia don't have a complete data set.

I manually checked some of the missing IDs, and some lead to deleted posts, the vast majority appear to lead to legitimate comments/links.

If it's a problem with your script, I guess that is easiest to fix. If it is a problem with Algolia, then I guess we're out of luck :(

dw commented 10 years ago

Examples:

https://news.ycombinator.com/item?id=90945 https://news.ycombinator.com/item?id=798357 https://news.ycombinator.com/item?id=777894 https://news.ycombinator.com/item?id=836304 https://news.ycombinator.com/item?id=992122

sytelus commented 10 years ago

Some of these may be missing from Algolia. One example I got chance to look at:

https://news.ycombinator.com/item?id=90945

I converted the story post time to Unix epoch (there are tons of sites to subtract days from date and then convert date to Unix timestamp) and then did following API query to get post by author that was less than this time.

https://hn.algolia.com/api/v1/search_by_date?tags=story,author_felipe&hitsPerPage=10&numericFilters=created_at_i%3C1199080820

The result I get does not have the above story which indicates that Algolia itself doesn't have the story.

I've filed issue at Algolia's hn-search repo: https://github.com/algolia/hn-search/issues/33

dw commented 10 years ago

Looks like these IDs are available via the new HN API.. https://github.com/HackerNews/API