hasadna / open-pension-net-scraper

Scrape net websites related to open pension
BSD 3-Clause "New" or "Revised" License
3 stars 2 forks source link

Open Pension Scraper

Build Status

This repo contains tools and [meta]data for the purpose of extracting publicly-available online information to be used by Open Pension.

Open Pension is a "Hasadna" project, aiming to reveal the secrets behind the Israeli pension market.

Pre Requirements

Installation

Note: if you source venv/bin/activate in a shell, you can skip the ./envrun.sh in commands here (still, it's handy if you open a shell only to run rq worker).

If you want rq-dashboard (for monitoring batch jobs via browser):

Running

Dump portfolio of a single month

[this is something you don't need redis for]

For exmaple: ./envrun.sh python -m web-sources.gemelnet 101 2016 1 would write Jan 2016 portfolio of kupa 101 to data/gemelnet-monthly-portfolios/101-2016-01.csv.

Pensianet allows dumping portfolios of all kranot in a single request: ./envrun.sh python -m web-sources.pensianet 2016 1 would write data/pensianet/2016-01.csv containing portfolios of all kranot for Jan 2016.

Batch dump reports over a period

run these on separate shells:

Dumping portfolios

For example: ./envrun.sh python batch_gemelnet.py 101 1999 8 2002 4 would queue jobs that dump portfolios of kupa 101 for all months between Aug 1999 and April 2002 into data/gemelnet-monthly-portfolios/101-1999-08.csv ... data/gemelnet-monthly-portfolios/101-2002-04.csv

Dumping performance reports

For example: ./envrun.sh python batch_gemelnet.py 101 1999 8 2002 4 --type p (or ./envrun.sh python batch_gemelnet.py 101 1999 8 2002 4 -t p) would queue a single job to fetch a performance report for kupa 101 between Aug 1999 and April 2002 and dump it into data/gemelnet/perf-101-1999-08-2002-04.csv.

There's no batch_pensianet.py [yet?], but you can get performance for a 12 month period by using (for example) ./envrun.sh python -m web_sources.pensianet 2016 10 --type p (or -t p). This would dump 11/2015-10/2016 performance of all kranot to data/pensianet/perf-2015-11-2016-10.csv. [This doesn't require redis].

Monthly incremental dumping

./dump-latest.sh [N] queues dumps of portfolios for all kupot N months ago, and performance reports for all kupot for the year between N+11 and N months ago.

It also dumps portfolios and performance of all pensianet kranot for the same periods, but these appear as single csv files (as opposed to a file per kupa in gemelnet).

Default for N is 2 (data for last month isn't available early in the month, while data for 2 months ago is always available).

Monitoring/controlling job execution

You can suspend job execution with ./envrun.sh rq suspend and resume work with ./envrun.sh rq resume

Utilities

[Don't require redis]

Generate CSV with totals of all Gemelnet portfolios for a given month

For example: ./envrun.sh gemelnet_totals.py 2016 9 would generate data/gemelnet/totals-2016-09.csv with "bottom lines" of all {kupa id}-2016-09.csv portfolios.

Batch lookup stock names at quotenet

For example: ./envrun.sh python batch_quotenet_stock_names.py < sample-data/quotenet-queries.txt > something.csv should generate something similar to sample-data/quotenet-output.csv.

Tests

Only pep8 so far ;)

Contribute

Just fork and do a pull request (;