laserson / ftptree

Crawl an FTP site and visualize file size-weighted directory tree
22 stars 5 forks source link

FTPTREE

Generates browsable tree map of an FTP site weighted by the amount of data in each directory. FTP site is crawled using scrapy.

Installation

Requires python>=2.7.

Required python modules:

Modules used for visualization/front end:

Clone package from GitHub, e.g.,

git clone git://github.com/laserson/ftptree.git

The web app is run directly from the root project directory using servetree.py.

Usage overview

sites.json contains a JSON list of FTP site metadata objects describing which sites to include in the visualization.

crawltree.py crawls an FTP site and generates a JSON object representation of the directory tree, including sizes of the files.

crawlsites.py is a script to crawl each site listed in sites.json.

servetree.py is the Bottle.py app that serves the visualization.

The static/ directory contains the Bootstrap files.

index.html is the main d3.js visualization.

To crawl:

scrapy crawl ftptree -a config_file=sites/ncbi.json -s JOBDIR=tmp_crawl/ncbi -o crawls/ncbi.txt -t jsonlines
scrapy crawl ftptree -a config_file=sites/ucsc.json -s JOBDIR=tmp_crawl/ucsc -o crawls/ucsc.txt -t jsonlines
scrapy crawl ftptree -a config_file=sites/cdc.json -s JOBDIR=tmp_crawl/cdc -o crawls/cdc.txt -t jsonlines

How to crawl an FTP tree

The crawltree.py script parses results from an FTP LIST command. The results differ based on the server properties. Primarily, the data format can be "unix", "windows", or "mlsd". The MLSD command is preferred as it returns pre-parsed file information.

To determine the method to use for a given FTP site, run e.g.

./crawltree.py --host ftp.cdc.gov --output data/cdc.json --test-method

which will return a sample listing. It will specify whether MLSD succeeded or failed. If failed, it will show an example listing so the user can determine whether it's Unix-like or Windows-like.

After the appropriate listing method is determined, a typical crawling command is issued like so:

./crawltree.py --host ftp.cdc.gov --output data/cdc.json --method windows

You can specify where to start the crawl by adding a --root path/to/root option, e.g.,

./crawltree.py --host hgdownload.cse.ucsc.edu --root goldenPath --output data/ucscgb.json --method mlsd

OLD OLD OLD

Crawled FTP sites

ftp://ftp.ncbi.nlm.nih.gov/sra

ftp://ftp.fcc.gov/ ftp://ftp.rma.usda.gov/pub/ ftp://ftp.epa.gov/ ftp://ftp.fsa.usda.gov/ ftp://ftp.ngdc.noaa.gov/ ftp://tgftp.nws.noaa.gov/ ftp://ftp.ncdc.noaa.gov/pub ftp://ftp.cdc.noaa.gov/ ftp://ftp2.census.gov ftp://emi.nasdaq.com/ ftp://ftp.nasdaqtrader.com/ ftp://ftp.resource.org/ ftp://ftp.uspto.gov/pub/ ftp://ftp.eia.doe.gov/ ftp://ftp.broadinstitute.org/pub ftp://ftpext.usgs.gov/pub/