postmodern / spidr

A versatile Ruby web spidering library that can spider a site, multiple domains, certain links or infinitely. Spidr is designed to be fast and easy to use.
MIT License
800 stars 109 forks source link

Simple Command Line Interface (CLI) #68

Open buren opened 6 years ago

buren commented 6 years ago

Side note: First of all thank you for an awesome gem. Over the past years and I've reached for this gem numerous times for various purposes big and small, its always a joy to use - thank you! 🙌


Simple Command Line Interface (CLI)

Rationale I've found myself wanting to do a "quick and dirty" crawl of different websites quite often. For example to find 4X, 5XX etc. So far I've written small Ruby scripts using spidr with the things I need. Many of these use cases could be solved with a fairly simple CLI.

Examples

spidr https://example.com

it supports all Spdir::Agent arguments

spidr --limit=10 --user-agent=myagent https://example.com

you can output multiple values (CSV-style), the columns argument map to methods on page

spidr --columns=code,url,title,content_type,meta_redirect? https://example.com

Usage

Usage: spidr [options] <url>
        --columns=[val1,val2]        Columns in output
        --content-types=[val1,val2]  Formats to output (html, javascript, css, json, ..)
        --[no-]header                Include the header
        --open-timeout=val           Optional open timeout
        --read-timeout=val           Optional read timeout
        --ssl-timeout=val            Optional ssl timeout
        --continue-timeout=val       Optional continue timeout
        --keep-alive-timeout=val     Optional keep_alive timeout
        --proxy-host=val             The host the proxy is running on
        --proxy-port=val             The port the proxy is running on
        --proxy-user=val             The user to authenticate as with the proxy
        --proxy-password=val         The password to authenticate with
        --default-headers=[key1=val1,key2=val2]
                                     Default headers to set for every request
        --host-header=val            The HTTP Host header to use with each request
        --host-headers=[key1=val1,key2=val2]
                                     The HTTP Host headers to use for specific hosts
        --user-agent=val             The User-Agent string to send with each requests
        --referer=val                The number of seconds to pause between each request
        --queue=[val1,val2]          The initial queue of URLs to visit
        --history=[val1,val2]        The initial list of visited URLs
        --limit=val                  The maximum number of pages to visit
        --max-depth=val              The maximum link depth to follow
        --[no-]robots                Respect Robots.txt
    -h, --help                       How to use
        --version                    Show version

todo


If you don't want to include this here then this could be a separate gem, something like spidr_cli (~with your blessing~ unless you object?). However it would probably be easier for others to find it if its here.

Thanks!

buren commented 6 years ago

I've created a spidr_cli gem which includes the above mentioned functionality, plus accept/reject hosts, ports, links and urls arguments and ability to chose what method to use: Spidr::site|host|start_at.

postmodern commented 3 years ago

Sorry for not noticing this. If I were to add a CLI it would need to be a class called Spidr::CLI. It would also need to catch Interrupt and Errno::EPIPE exceptions (see: how command_kit handles this). Also, would need a --format or --output-format option to control plain text, CSV, or JSON. Would also need specs that invoke the command and uses RSpec's .to output(...).to_stdout.

Not to plug my own code too much, but you might want to consider using command_kit for your spidr-cli gem?

postmodern commented 2 years ago

If you want to get this merged, checkout the CLI class from wordlist.rb. Feel free to copy it's zero-dependency boilerplate CLI code.