joshua-hull / Reddit-Image-Scraper

Perl script to download imaged hosted at imgur.com linked from a subreddit at reddit.com
25 stars 8 forks source link

issue #11 - create subdirectories for each imgur album #20

Closed aggrolite closed 10 years ago

aggrolite commented 10 years ago

this change also includes a fix for album extraction as imgur's json structure changed just enough to break our code

this should satisfy issue #11

aggrolite commented 10 years ago

@joshua-hull I should add that the subdirectories I create are named after the imgur album's hash ID. For example, if we scrape imgur.com/a/abcdef from r/pics, the files from that album will be saved in the directory pics/abcdef

aggrolite commented 10 years ago

@joshua-hull I added another commit to this branch which is unrelated to issue #11. This change can be summed up in the commit message:

 improve search query, cleanup directory name

reddit supports a "site:" options which will restrict search
results to a specified domain (i.e. site:imgur.com). I have setup
the code so that we can easily add more domains (like i.minus.com),
so that domains are joined by " OR ".

example search query after we support more domains:
"hey reddit check out my dog (site:imgur.com OR site:i.minus.com)

directory names have been cleaned up a bit before saving images.
I ran into the problem of searching for a string like:
"check out my dog/cat" and the script created a directoy AND a
subdirectory. this change prevents that.
atommclain commented 10 years ago

I've played a bit with the initial pill request and I noticed that there are images being request with the "?1", ie image.jpg?1

Also, a thought; now that albums are supported (thanks :) ) I realized that albums with sequential images do not retain their proper order since the file names are based on the original image URL.

aggrolite commented 10 years ago

@atommclain I'm not seeing the extra "?1" on image urls. how did you run the script?

the order of the images saved from albums is the same if you sort by date-modified ie ls -ltr pics/albumdir/. it's just ordered by name when doing a plan ls or viewing in a file explorer (if that's what is default). did I understand your statement right?

aggrolite commented 10 years ago

though it might be better to save the image with a more identifiable name. since the album directory is already unique, we could maybe just save the image names as 1.jpg, 2.jpg, 3.jpg, etc. that way they are in order when viewing in a file explorer and doing a plain ls

aggrolite commented 10 years ago

@atommclain see what you think of the new commit I pushed. images downloaded from albums are now saved as a number rather than a hash ID, which keeps the correct order as they show on the site.

sample output:

$ ./Reddit_Image_Scraper pics
Extracted 1007 images from subreddit pics
Downloading http://i.imgur.com/qFHNLvn.jpg to pics/qFHNLvn.jpg
Downloading http://imgur.com/O3EYmhT.jpg to pics/OFnEZ/1.jpg
Downloading http://imgur.com/glHO2Cx.jpg to pics/OFnEZ/2.jpg
Downloading http://imgur.com/nSMBCFY.jpg to pics/OFnEZ/3.jpg
Downloading http://imgur.com/1inK2Mh.jpg to pics/OFnEZ/4.jpg
Downloading http://imgur.com/NXlN1oS.jpg to pics/OFnEZ/5.jpg
Downloading http://imgur.com/1oEfAi5.jpg to pics/OFnEZ/6.jpg
Downloading http://imgur.com/Sm8I2oN.jpg to pics/OFnEZ/7.jpg
Downloading http://imgur.com/NtgMhyU.jpg to pics/OFnEZ/8.jpg
Downloading http://imgur.com/2mQCsOf.jpg to pics/OFnEZ/9.jpg