joshua-hull / Reddit-Image-Scraper

Perl script to download imaged hosted at imgur.com linked from a subreddit at reddit.com
25 stars 8 forks source link

scrape videos? #24

Open aggrolite opened 10 years ago

aggrolite commented 10 years ago

not sure if this interests you @joshua-hull, but have you considered scraping videos on reddit? this could be an option the user defines when executing the script (like --videos). I was going to at least make a fork of my own with this change, so let me know what you think.

joshua-hull commented 10 years ago

This sounds a really good idea. My only question is where would they be scraped from? If it's a simple .mpg link or the like then it should be should be simple. Just regex the common extensions. Let me know what you were thinking.

aggrolite commented 10 years ago

so the big sites we could support might include youtube, vimeo, and liveleak. just glancing in the source of a liveleak video, I do see a .mp4 link which we could look for to save. IIRC, I think it may be harder to get the video source from youtube, but there might be a CPAN module out there that could help us. But if the video source is in the HTML, we parse Mechanize's content:

my $content = $mech->content;
my ($video) = $content =~ m!\b(http://.+?\.mp4)\b!i;

though it might be better to use a library like HTML::TreeBuilder::XPath to extract the links:

use HTML::TreeBuilder::XPath;
my $content = $mech->content;
my $tree = HTML::TreeBuilder::XPath->new_from_content($content);
my ($video) = map { m!\b(http://.+?\.mp4)\b!i } $tree->findvalues('.//script/text()');

then once we have the mp4 (or any other format) link, we could just save the file like we do an image

aggrolite commented 10 years ago

there would be a benefit of using XPath too if we plan to extract anything else from HTML. it just makes it a lot easier. using regex on HTML can get pretty ugly IMO

aggrolite commented 10 years ago

oh, and XPath might be an easy way to detect things like bad subreddits or when we hit the "Ow!" page too

joshua-hull commented 10 years ago

I agree with keeping it generic with XPath. This will help with extracting the info from any pages in the future. So in the example above would $video be the URL of the video?

aggrolite commented 10 years ago

yep. maybe $video_url would be a better name. that could probably work for liveleak. vimeo and youtube might be different. though we would need to check and see if they offer multiple links for different resolutions. so we'd have to think about if we want HD or not, or maybe let the user decide with a shell option

joshua-hull commented 10 years ago

I would check out https://github.com/rg3/youtube-dl for hints as to the structures in various pages.

aggrolite commented 10 years ago

cool, I also found WWW::YouTube::Download on cpan. it lets you pass a user agent too, so we could pass it mech's UA

aggrolite commented 10 years ago

one review says it's better maintained than youtube-dl, but I've never used either: http://cpanratings.perl.org/dist/WWW-YouTube-Download

joshua-hull commented 10 years ago

Ya, I saw that but didn't want to use it in order to stay generic. youtube-dl should provide info on how to extract the URL for the video file at the very least. I use it with YouTube as it at least stays up to date with that. Plus it will provide more info for more sites.

aggrolite commented 10 years ago

i wouldn't worry too much about writing one generic method to extract videos. one thing to think about is how we're going to decide if a link is a video or not. i think posts are typed as self or link. if that's the case, we'd have to just check the domain, which is specific code to begin with. on top of that, these sites do change occasionally and if one sites change but the others don't, we'd have to fix up the generic code just for one site while making sure that the other sites still work.

though if we could come up with one method to scrape the videos, it might be preferable as long as the code doesn't look too hairy. though the only really popular video domains that show up on reddit are youtube, liveleak and vimeo anyway.

another thing to think about might be reorganizing the script into a module or a modulino before adding this feature. that way if we do have to add lots of video-related code, we could put it all in a new class just to keep it organized (including other code added in the future)

aggrolite commented 10 years ago

http://www.drdobbs.com/scripts-as-modules/184416165

check out this book too, it's really good: http://www.amazon.com/Effective-Perl-Programming-Idiomatic-Development/dp/0321496949

aggrolite commented 10 years ago

oh hey, I just discovered HTML::TreeBuilder is a dependency of WWW::Mechanize: http://deps.cpantesters.org/?module=WWW%3A%3AMechanize;perl=latest

So adding HTML::TreeBuilder::XPath as a dependency shouldn't matter much then