Open ianmilligan1 opened 8 years ago
...which in-turn uses https://github.com/edsu/unshrtn
We could incorporate that in. Or, create a method in warcbase that does the same thing, or maybe there is already a Java library that does unshortening that we could just pull in.
Do we have a file which has the mapping from short urls to the full URLs? If so, I can show you how to join in the data...
@lintool can you clarify what you mean by "a file that has the mapping from short urls to the full URLs"?
...or, is this what you're looking for? https://github.com/edsu/unshrtn/blob/master/unshrtn.coffee
File that has:
http://t.co/pbFMYFZpQC http://foo.bar.com/
http://t.co/pg3SFzLc http://foo.bar.com/
...
Oh, https://github.com/edsu/twarc/blob/master/utils/unshorten.py#L37-L53 puts it back in the dataset with a new entry.
If I understand correctly what it's doing, that's absolutely terrible. That's the digital equivalent of going through a paper archive with a black magic marker, crossing out historical place names and replacing them with their modern names. Would you do that to a paper archive? No! So don't do it to a digital archive.
The correct way to do this is to have a separate file that has the mapping (per above), and join in the unshortened form during processing.
EDIT: okay, it adds in a new field in the JSON, which isn't as bad as I thought. Analogy would be to go through a paper archive and put a post-it note next to every instance of a historical place name and on the post-it note write it's modern name.
You don't do it on the preservation/master version of the dataset, you always cat
it out to a new file. By default it is stdout
. It only reads the preservation/master version of the dataset.
If that's the case, it's a waste of space. You still just want
short long
short long
...
Would the output be:
short, count, long, count
http://t.co/pbFMYFZpQC, 12, http://foo.bar.com/, 123
You wouldn't even need the count. If you just had short, long
, you can process the original archival JSON and just join in the long form as needed.
Just re-opening this. Did we reach any agreement here?
Do we have a way to generate a file that has the following?
short-url full-url
short-url full-url
...
Right now, our script for URL extraction is as follows:
By grabbing tweets from the
text
field we just get results like:This is not very useful – so what's the best path? In the past, @ruebot and I have used
unshorten.py
in twarc.