mkiser / wtfsource

This repo has moved
https://github.com/mkiser/WTFJHT
2 stars 2 forks source link

Cache copies of linked resources #3

Closed StevenBlack closed 7 years ago

StevenBlack commented 7 years ago

Great work so far, Matt!

A thought, since I've seen this movie before.

Fast-forward a few months, years perhaps, and key linked resources vanish because link rot happens.

This is frustrating and inevitable. Moreover bum-steers greatly diminish the body of work which, in this case, has historical intererst.

Have you thought about creating a text-version cache of most, if not all, the articles and tweets linked here? This could be automated. Volunteers could help with that.

mkiser commented 7 years ago

Yep! That's exactly what I'm planning to do - a full API of each day's update with the associated link, and a static copy of the link's text/metadata+screenshot of the page at the time

Will have a call for help in a future newsletter

StevenBlack commented 7 years ago

Great.

Just spit-balling ideas here.

Instapaper has an API and it has a great UI in desktop, iOS, and Droid scenarios. It has a wide range of bookmarklets and good presence in share sheets. It also does a damn good job of extracting readable text form articles by most (all?) publishers.

I mention this because process matters. I see two options here:

  1. Write first, then capture source from the links that end-up in the text. Or
  2. capture madly, then distill and write at leisure.

I suggest option 2. Primarily because it optimizes harvesting, and permits taking a day off once in awhile without fear that something flammable will disappear before you've had a chance to process it.

mkiser commented 7 years ago

I work at a company that does microservices as a service. Shouldn't be much of a problem to use something like DiffBot to grab structured data from the post and add it to a JSON object as part of the Travis CI build process.

Good call on addressing sooner than later.

StevenBlack commented 7 years ago

I did something like this between 2004 and 2008. Now nine-years later, all the newspaper articles are unreachable, my city changed its website and its underlying document management system, so many links are dead. Saving the copy and paste of each article was smart, but I may never retrofit all those dead links because, man, that's a couple of thousand text files.

To be totally professional I would say, dish links to sources and, when those sources go dark, revert to cached sources. The key is to data-drive this au max.

A service like Instapaper gives each capture an addressable ID. Carry that ID around and reverting to cache will be easy.

Nice to know: link rot tends to happen systematically. For example, some day if the Washington Post changes systems, or changes its archive access policy, you can remediate hundreds of references because you have the data.

mkiser commented 7 years ago

repo moved: https://github.com/mkiser/WTFJHT