scripting / feedBase

A project to get feeds into a base.
MIT License
33 stars 3 forks source link

Friday work #1

Open scripting opened 6 years ago

scripting commented 6 years ago

This project is currently in two incompatible pieces.

A database that contains subscriptions, and a UI that allows people to post OPML files to a server.

I started the latter project in 2016, but didn't feel it generated enough interest to continue development. However in 2018, I feel differently about it. I'll get into that some more as we go along.

I'm going to post questions to this project, and eventually post source code that implements the database, and possibly the UI.

scripting commented 6 years ago

Here's a screen shot of feedBase at this moment.

image

jystervinou commented 6 years ago

Are you using some kind of "lastfetched" date?

scripting commented 6 years ago

If I understand correctly -- yes, that's what is going on.

scripting commented 6 years ago

it should be whenupdated

scripting commented 6 years ago

The name of the database is feedbase.

There are two tables, feeds and subscriptions.

Here are the two commands that create the tables.

create table subscriptions (username varchar (255), listname varchar (255), feedurl varchar (512), whenupdated datetime, PRIMARY KEY (username, feedurl));

create table feeds (feedurl varchar (512), title varchar (255), htmlurl varchar (512), description varchar (512), whenupdated datetime, PRIMARY KEY (feedurl));

scripting commented 6 years ago

It's all comin along.

Here are the titles of the feeds currently in the database.

| title                                                     |
+-----------------------------------------------------------+
| 100 Proof                                                 |
| 10x14s @KenSmith                                          |
| The 6th Floor                                             |
| Andreessen Horowitz                                       |
| annethoughts                                              |
| Uploads from scriptingnews                                |
| Uploads from evekwiner                                    |
| Uploads from alan_sailer                                  |
| AwkwardFamilyPhotos.com                                   |
| The Baseline Scenario                                     |
| Bedford + Bowery                                          |
| Home Spun                                                 |
| BikeAthens                                                |
| Bike Snob NYC                                             |
| OPML Editor code updates                                  |
| Adam Curry's Weblog                                       |
| David Darias                                              |
| Digg Blog                                                 |
| easyDNS Blog                                              |
| Evernote Blog                                             |
| Flickr Blog                                               |
| Bootstrap Blog                                            |
| GlennLog                                                  |
| Jimmy Guterman's blog                                     |
| Stealthmode Blog                                          |
| The Old Reader: behind the scenes                         |
| SIN INC                                                   |
| Doc Searls Weblog                                         |
| John Palfrey                                              |
| Philip Greenspun's Weblog                                 |
| Casual Information                                        |
| Brendan Eich                                              |
| Built With Bootstrap                                      |
| camen design · blog                                       |
| cdixon blog                                               |
| Chao Lam                                                  |
| Citi Bike | Blog                                          |
| code4lib                                                  |
| Blog                                                      |
| Crimes of The Times                                       |
| Daniel Bachhuber                                          |
| Daring Fireball                                           |
| Dustin Curtis                                             |
| 0xDECAFBAD                                                |
| Dan Lewis                                                 |
| EdCone.com                                                |
| Embracing the Absurd                                      |
| Emily Bell(wether)                                        |
| Emma Winer                                                |
| Al Jazeera English                                        |
| Armed and Dangerous                                       |
| Eve's Peeves                                              |
| EV Grieve                                                 |
| Fargo Blog                                                |
| Dilbert Blog                                              |
| Dilbert Blog                                              |
| themodernscientist                                        |
| Bowery Boogie                                             |
| Engadget RSS Feed                                         |
| 24 ways                                                   |
| Rumproarious                                              |
| The Dish                                                  |
| AttentionMax                                              |
| Beat the Press                                            |
| Bike Hugger                                               |
| Bootswatch                                                |
| Both Sides of the Table - Medium                          |
| Brain Pickings                                            |
| Op-Eds & Columns                                          |
| Coding Horror                                             |
| best phone plans                                          |
| Curbed Hamptons -  All                                    |
| Cycling Uphill                                            |
| BooksChantcdCom                                           |
| Dan Reich                                                 |
| David Simon                                               |
| Design Shack                                              |
| Fast Company                                              |
| Fast Company                                              |
| Us Phone Plans                                            |
| Franklin Chen                                             |
| G Suite Updates Blog                                      |
| GregFalken.com                                            |
| Have I been pwned? latest breaches                        |
| HeadButler                                                |
| ILoggable                                                 |
| Improv Everywhere                                         |
| IPEVO Blog                                                |
| Domain Industry & Internet News                           |
| James Fallows | The Atlantic                              |
| Joi Ito's Web                                             |
| NY Mag                                                    |
| Joshua Best                                               |
| Marco.org                                                 |
| mattdiaz.org                                              |
| Mark Otto                                                 |
| MoJo Blogs and Articles | Mother Jones                    |
| Nathan Laan                                               |
| NYC Bike Maps                                             |
| Nieman Lab                                                |
| Nieman Journalism Lab: Fuego                              |
| Node.js Blog                                              |
| Nordquist Blog                                            |
| NYCwireless                                               |
| BooksChantcdCom                                           |
| On Deciding . . . Better 3.0                              |
| Patent – Patently-O                                       |
| Argan Oil                                                 |
| Political Wire                                            |
| Hit & Run                                                 |
| Pingdom Royal                                             |
| Ranjith Antony                                            |
| this is sippey.com                                        |
| Snook.ca                                                  |
| The Parlour at FG Press                                   |
| Sundance Institute | Digest                               |
| Editor's Blog                                             |
| The Atlantic                                              |
| The Brink of Chaos                                        |
| The Cloud                                                 |
| The Orange View                                           |
| Best Cell Phone Reviews                                   |
| Wirecutter: Reviews for the Real World                    |
| TorrentFreak                                              |
| Tweetage Wasteland                                        |
| Twitter Blog                                              |
| Seth Godin's Blog on marketing, tribes and respect        |
| Wir sprechen Online.                                      |
| WP Engine                                                 |
| kottke.org                                                |
| Real Lawyers                                              |
| MobileRead Forums                                         |
| Serial                                                    |
| Top News - MIT Technology Review                          |
| Thejesh GN                                                |
| This American Life                                        |
| On the Media                                              |
| Radiolab                                                  |
| The Brian Lehrer Show                                     |
| David Walsh Blog                                          |
| CentralPark.com RSS Feed                                  |
| ProgrammableWeb                                           |
| The Awl                                                   |
| The Future Buzz                                           |
| Central Park Events from CentralPark.com                  |
| FeaturesFeatures – FiveThirtyEight                        |
| FOSS Patents                                              |
| Marketplace Tech with Molly Wood                          |
| Garrick van Buren                                         |
| NULL                                                      |
| Not enough bits in this byte                              |
| George Lakoff                                             |
| GCZ Import HID Lights Mods                                |
| Latest Sandy Images and Videos (Updating Live)            |
| Glass                                                     |
| Global Voices                                             |
| Pressing Issues                                           |
| NBA.com | Hang Time Blog                                  |
| Hapgood                                                   |
| Harbin Hot Springs Blog                                   |
| Operating Partner, DFJ                                    |
| Hacker News Firehose                                      |
| Editor: Myself                                            |
| Holtz Communications + Technology | Blog                  |
| Northern Tier 2010                                        |
| Hugh Techno                                               |
| Brad Ideas                                                |
| inessential.com                                           |
| In These Times                                            |
| The Iron Yuppie                                           |
| Just Well Mixed                                           |
| Quote and Comment                                         |
| Newspaper Heroes on the Air                               |
| Jillian C. York                                           |
| Joe Hewitt                                                |
| Media, disrupted                                          |
| Jon Eisen's Blog                                          |
| A Blog by Kashif                                          |
| AllThingsD » Kara Swisher                                 |
| Baseball Nerd                                             |
| Laughing Squid                                            |
| In Memory of Leon Winer                                   |
| Dave Winer                                                |
| Little Card Editor                                        |
| Circa                                                     |
| 140-char Dave ????????                                    |
| My First Outline                                          |
| Buzz Andersen                                             |
| davewiner's RSS Feed                                      |
| Slate Daily Feed                                          |
| Mediagazer                                                |
| Memex 1.1                                                 |
| The Millikan Daily                                        |
| Mini-Microsoft                                            |
| MRA Dilbert                                               |
| NULL                                                      |
| Dave Winer                                                |
| Ralph Nader                                               |
| Neither more nor less                                     |
| Search Results for “wikileaks” – NewsFeed                 |
| The ephemera of Nicolas Gallagher                         |
| Curbed NY -  All                                          |
| The Rancid Honeytrap                                      |
| Olivier Lafleur                                           |
| Om Malik                                                  |
| One Thing Well                                            |
| Times Open - Medium                                       |
| owensd.io                                                 |
| Pee-wee's blog                                            |
| Penny Red                                                 |
| The World Outline Podcast                                 |
| PressThink                                                |
| Quartz                                                    |
| John Biesnecker                                           |
| Circa                                                     |
| Dave Winer                                                |
| Melody Joy Kramer                                         |
| Release Candidate One                                     |
| River2.org                                                |
| River4                                                    |
| Hakkarainen Clipping Service                              |
| Roger and Mike's Hypernet Blog                            |
| Forward Thinking                                          |
| My New Blog                                               |
| Robert Scoble's Augment Your Life                         |
| Scripting News                                            |
| Scripting News                                            |
| Setheag                                                   |
| silencematters                                            |
| Source to Nuts                                            |
| www.espn.com - NBA                                        |
| Active questions tagged rss - Stack Overflow              |
| Deeplinks                                                 |
| aegrumet's linkblog feed                                  |
| Dave Winer's linkblog feed                                |
| jasonpontin's linkblog feed                               |
| karin's radio2 feed                                       |
| karlmay's linkblog feed                                   |
| Rex Hammock's LinkBlog                                    |
| Worth Reading | Links from Jeremy Zilar                   |
| Vincent's Microbe News                                    |
| My test feed on Argentina                                 |
| Anton's sugarcubes                                        |
| Amazon Elastic Compute Cloud (N. Virginia) Service Status |
| Bob Stepno's Other Journalism                             |
| Stratechery by Ben Thompson                               |
+-----------------------------------------------------------+
jystervinou commented 6 years ago

What is whenupdated in the subscriptionstable, for?

scripting commented 6 years ago

Fair question.

Suppose you upload your OPML file on Tuesday.

Of course I read the file and add all the subscriptions to the database.

Then on Friday you re-upload it. Presumably that will cause me to re-add the subscription.

Not sure why it matters that it was updated, but technically it was.

jystervinou commented 6 years ago

OK, so that's when the subscription itself is updated (per user), and whenupdated in feeds is when the feed has last been http fetched by the crawler?

Do you need to know when the subscription has been created for the first time too?

scripting commented 6 years ago

Okay, let's table the date stuff for now, and move toward the first goal, a hotlist, just like the one from bingeworthy. I have this query, and it works.

SELECT feedurl, COUNT(feedurl) AS countSubs FROM subscriptions GROUP BY feedurl ORDER BY countSubs DESC LIMIT 100;

It gives me the top 100 list, but in a fairly useless form.

Here's a screen shot of the first few elements. Note that Brent's feed is first. We have to fix that of course. ;-)

image

Now I have this very nice table called feeds, indexed on feedurl, just like subscriptions, where I can get the title, htmlurl and description that will make the table nice and human readable.

The question I have is very basic. How do I query both tables in one command. I gather this is a JOIN of some kind.

scripting commented 6 years ago

BTW, here's a live version of that query --

http://rockaway.opml.org:1405/hotlist

Dave

scripting commented 6 years ago

BTW, JY -- if you can give me an idea of what to search for -- I'd love to actually learn how this works, rather than just by trial and error. There must be many good examples, this is what SQL is for, right??

jystervinou commented 6 years ago

Just checking in quickly, will come back later. Maybe this will help.

scripting commented 6 years ago

Thanks JY. I'm going to post this as a query to the braintrust on the Scripting News repo. Look for a pointer on the blog. ;-)

On Sat, Jan 27, 2018 at 11:13 AM, Jean-Yves Stervinou < notifications@github.com> wrote:

Just checking in quickly, will come back later. Maybe this https://blog.codinghorror.com/a-visual-explanation-of-sql-joins/ will help.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/scripting/feedBase/issues/1#issuecomment-360995086, or mute the thread https://github.com/notifications/unsubscribe-auth/ABm9Ow5FQu6YJ-KEZ2VxU5W2iN0t9VM_ks5tO0s-gaJpZM4RuamY .

jystervinou commented 6 years ago
SELECT
  subscriptions.feedurl,
  feeds.title,
  COUNT(subscriptions.feedurl) AS countSubs
FROM subscriptions, feeds
WHERE subscriptions.feedurl = feeds.feedurl
GROUP BY feedurl
ORDER BY countSubs
DESC LIMIT 100;
scripting commented 6 years ago

Hey it worked!!

That is beautiful.

jystervinou commented 6 years ago

If we are searching only on the feedurl part of the subscriptions table primary key (username, feedurl), you must use, i think: PRIMARY KEY (feedurl, username) instead of PRIMARY KEY (username, feedurl). The order is important.

scripting commented 6 years ago

Not sure what that last comment means. But the report I'm looking at here is very close to what I want. I also want the htmlurl, but I expect I just have to add that to the query.

scripting commented 6 years ago

Here's the first few lines of the report.

| feedurl                                                | title                                  | countSubs |
+--------------------------------------------------------+----------------------------------------+-----------+
| http://inessential.com/xml/rss.xml                     | inessential.com                        |         9 |
| http://scripting.com/rss.xml                           | Scripting News                         |         8 |
| http://feeds.kottke.org/main                           | kottke.org                             |         7 |
| http://feeds.wnyc.org/radiolab                         | Radiolab                               |         6 |
| http://feeds.99percentinvisible.org/99percentinvisible | 99% Invisible                          |         6 |
| http://fivethirtyeight.com/features/feed/              | FeaturesFeatures – FiveThirtyEight     |         5 |
| http://www.npr.org/rss/podcast.php?id=510289           | Planet Money                           |         5 |
| http://jvns.ca/atom.xml                                | Julia Evans                            |         5 |
| http://feeds.gimletmedia.com/hearreplyall              | Reply All                              |         5 |
| http://daringfireball.net/feeds/main                   | Daring Fireball                        |         5 |
| http://www.asymco.com/feed/                            | Asymco                                 |         5 |
| http://www.marco.org/rss                               | Marco.org                              |         5 |
| http://daringfireball.net/index.xml                    | Daring Fireball                        |         4 |
| http://onethingwell.org/rss                            | One Thing Well                         |         4 |
| http://uncrunched.com/feed/                            | Uncrunched                             |         4 |
| http://www.theverge.com/rss/index.xml                  | The Verge -  All Posts                 |         4 |
jystervinou commented 6 years ago

@scripting if you define an index on two fields (a,b), you will have two indexes available : (a,b), but also (a).

As you are querying only on feedurl (not username for now), you must define the primary key as (feedurl, username) otherwise MySQL will do a full scan of the table, not using any index.

makes sense?

jystervinou commented 6 years ago

Kick Brent off this table like Apple News, come on ! ;-)

jystervinou commented 6 years ago

you can add EXPLAIN just before the command in any MySQL client, to see how MySQL is using the index (or not)

EXPLAIN SELECT etc...

scripting commented 6 years ago

You're recommending that I change the way the table is created to have the primary key be

(username, feedurl)

So that it will be possible to get back all the subscriptions for an individual user quickly.

Yes??

scripting commented 6 years ago

However I think I did just that:

create table subscriptions (username varchar (255), listname varchar (255), feedurl varchar (512), whenupdated datetime, PRIMARY KEY (username, feedurl));

jystervinou commented 6 years ago

No, the opposite : (feedurl, username)

scripting commented 6 years ago

OK, I'll change it and repopulate.

jystervinou commented 6 years ago

otherwise, there won't be any index on feedurl alone.

jystervinou commented 6 years ago

you can drop and recreate the index. (Sequel Pro is your future best friend ;-) )

scripting commented 6 years ago

Let's switch over to the Saturday thread. ;-)

On Sat, Jan 27, 2018 at 1:10 PM, Jean-Yves Stervinou < notifications@github.com> wrote:

you can drop and recreate the index. (Sequel Pro http://www.sequelpro.com is your future best friend ;-) )

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/scripting/feedBase/issues/1#issuecomment-361003125, or mute the thread https://github.com/notifications/unsubscribe-auth/ABm9O0OWTORndf82Y4Am80XJvGYwUhM-ks5tO2akgaJpZM4RuamY .