URL is frequently t.co - Githubissues

StevenMaude commented 11 years ago

Tried two users; in both cases, ~95%+ of URLs were t.co

This renders the URL column fairly useless; presumably related to frabcus/twitter-search-tool#15.

StevenMaude commented 10 years ago

Something like this might be nice to resolve these URLs (or even just requesting the URL from Twitter), provided the demands of doing so aren't going to be excessive for the platform. ~~(unshort.me's API limit is "100,000 requests per day", so should outpace the demands of the Twitter follower tool, provided we generate a unique API key for each scraper.)~~ It is a little flaky; I was finding it was giving no result at all for a minority of valid t.co URLs.

I'm also trying this resolver; this seems more reliable but, in some cases, returns "status":"InvalidURL" together with a resolved URL that actually works. Since broken links also give InvalidURL status, you can't reject broken links at this stage .

Edit: testing on 1000+ URLs, it's much more reliable. There is a minor issue that the 'end_url' is sometimes some trailing part after the URL e.g. Homepage.aspx, but this is easy to fix. I'll be modifying my resolver repo to use this.

(Aidan provided me a list of a potential customer's Twitter followers obtained from the platform and I now need to unshorten the URLs myself.)

frabcus commented 10 years ago

Isn't there a native Python unshortening module? I'm doubting it would be much code!

StevenMaude commented 10 years ago

I tried various things this afternoon/evening... using unshort.me, using expandurl.me, using requests, running cURL as subprocess...

From memory,

def unshorten(short_url):
    r = requests.get(short_url)
    return r.url

does work to some extent¹, though IIRC (I've got a little muddled with what worked with which approach) it was falling over on sites that it couldn't find. Sites might be down permanently, but could be just temporarily down. Depends how thorough lookup should be really. I wanted all URLs for what I was doing, but it probably doesn't matter too much if one or two slip through that may well be broken anyway. I'm not sure at what point Twitter might start blocking t.co requests though.

I did try this urlclean module too, but it didn't handle broken redirects/sites very gracefully either.

(Incidentally, another nice thing about the expandurl API I used in the end is that it followed to the final redirect; you could encounter URLs where someone has stuck, e.g. bit.ly in their profile that gets wrapped as t.co. Would be interesting to see what the requests call does in this case...)

¹ No, not much code!

StevenMaude commented 10 years ago

My repo for this using expandurl is here.

scraperwiki / twitter-follows-tool

URL is frequently t.co #55