oyiptong / up-headliner

Headliner is a JSON API that returns personalized content obtained from providers
Mozilla Public License 2.0
0 stars 2 forks source link

Randomize aricltes instead of by sorting them by interest_rank,time #10

Closed Mardak closed 10 years ago

Mardak commented 10 years ago

Instead of just fetching articles and concating them by interest, we'll want to make sure they're sorted by time overall.

This is as opposed to the original suggestion Mardak/profile/issues/23 to randomize.

Mardak commented 10 years ago

It seems that the data store doesn't quite store/return things by most recent. I think it's just that it happens to get more recent items inserted before the others. For example currently:

https://headliner.mozillalabs.com/nytimes/mostpopular/Technology.json

Shows "url": "http://www.nytimes.com/2014/01/23/technology/personaltech/review-the-roomba-880-from-irobot.html?src=recmoz" then "url": "http://www.nytimes.com/2014/01/30/technology/personaltech/on-facebook-deciding-who-knows-youre-a-dog.html?src=recmoz"

And there's no explicit date/time field to sort by.

mzhilyaev commented 10 years ago

There's sub-optimal time ordering example: interests: '{"Programming":0.25,"Sports":0.25,"Autos":0.25,"Arts":1} List of suggested urls: 2014-01-30/technology/personaltech/on-facebook-deciding-who-knows-youre-a-dog.html 2014-01-27/sports/committing-to-play-for-a-college-then-starting-9th-grade.html 2014-01-27/automobiles/makers-pack-new-cars-with-technology-but-younger-buyers-shrug.html 2014-01-26/automobiles/autoreviews/the-ecstasy-of-excess-the-agony-of-the-sticker.html 2014-01-23/technology/personaltech/review-the-roomba-880-from-irobot.html 2013-12-05/sports/baseball/three-rings-erase-sting-of-losing-ellsbury.html 2013-11-28/arts/saul-leiter-photographer-with-a-palette-for-new-york-dies-at-89.html 2013-11-20/arts/monty-python-troupe-to-reunite-for-live-shows.html 2013-11-19/arts/syd-field-author-of-the-definitive-work-on-writing-screenplays-is-dead-at-77.html 2013-11-19/arts/barbara-park-author-of-junie-b-jones-series-dies-at-66.html

Note that all arts are pushed to the bottom.

This use case forced us to move back to randomization of interests,

oyiptong commented 10 years ago

We can make sure the articles are sorted by date. I don't understand why we need to randomize.

The articles are store by receipt time: https://github.com/oyiptong/up-headliner/blob/master/up/headliner/data.py#L34

This is to mimic the behavior that occurs when articles are received.

The reasoning is that news that get in the "most popular" list are those that are gaining momentum. We are returning results by the relevance to popular opinion, not by the time the article was published.

e.g. a moving profile written about Bill Clinton in 1998 in suddenly comes to light.

oyiptong commented 10 years ago

That's how the "Most Popular" API works. The newest articles to make the list are not necessarily the most recently published ones.