Data pipeline - Githubissues

epogrebnyak commented 7 years ago

Our project is about aggregating data from individual parsers under common namespace and releasing the data through final API (correct if something missing):

Raw data source - a file or a web page at Rosstat or other agency.
Parser reads source and provides resulting CSVs in its data/processed folder.
Some automated and human eye validation is done during/after parsing.
'daemon' collects data/processed CSVs from several parsers in common database.
The user queries database through API to get aggregated data from parsers (the simplest call returns aggregate CSV with variables from diffrernt parsers).
The user reads data from from URL or stores data as a local file (using standard code in R/pandas or designated package).
The user writes code in Jupiter notebook with R or pandas using the data.

Comments welcome.

neotheicebird commented 7 years ago

Maybe release python and/or R api-client for users to access the API easily from jupyter notebooks?

Rotzke commented 7 years ago

@neotheicebird Yeah, we discussed it at the meeting - using AWS API.

neotheicebird commented 7 years ago

@Rotzke awesome! Just to keep us on the same page, I mean a python/R client side library apart from the web API development. Thanks

Rotzke commented 7 years ago

@neotheicebird Up to you, good sir :) Created a new issue on teams.

epogrebnyak commented 7 years ago

@neotheicebird - al least some standard code to access the data will be very useful. In pandas we have somehting like:

dfm = pd.read_csv(url_m, converters = {'time_index':pd.to_datetime}, 
                                           index_col = 'time_index')

This works to read monthly data from stable URL, it is slow to query internet every time we run the program so may have some class to load/update data, similar to one below (from here):

class LocalDataset():

    def __init__(self, _id):        
        self._id = _id
        try:
            self.ts = get_local_data_as_series(_id)
        except:
            print("Cannot load from file: " + self.filename)
            self.update()

    def update(self):
       self.ts = get_data_as_series(self._id)
       save_local_data(self._id, self.ts)
       return self

Maybe this can be a client/small librabry/pypi package, but as far we can do a just some preferred code to download and manipulate the data. Updated pipeline in indicate this.

neotheicebird commented 7 years ago

Awesome, didn't know about pd.read_csv having an URL arg

neotheicebird commented 7 years ago

@epogrebnyak the code example and a simple pypi package to access API sounds good

Rotzke commented 7 years ago

@neotheicebird @epogrebnyak Guys, we have Slack for chatting! :)

epogrebnyak commented 7 years ago

Based on discussion with @Rotzke, updated pipeline:

Scrapers collect source files/web pages from Rosstat or other agencies to document file store / database (Mongo, a NoSQL database). This is raw immutable data.
Parsers take raw files from filestore / database and make processed files (CSV or other).
Automated and human eye validators check parsing result.
Aggregator daemon collects data from processed files by different parsers to a production database.
Web frontend gives end users database overview and access instructions.
End users apply standard code or packages in R/pandas to access production database (may also cache data to a local file):
- by reading a downloadable CSV dumps at stable URL
- by end-user API
The user writes code in Jupiter notebook with R or pandas using the data.
Visualization and data analysis by end user impress everyone.
...
Profit! ;-)

epogrebnyak commented 7 years ago

Some more detail on pipeline, based on mini-kep:

[x] is something a parcer does

Raw data:

[ ] store archive publications at AWS S3 in zip/rar (now a local folder)
[ ] download newest publications from Rosstat web site (now a local folder)
[ ] unzip/unrar Word files (now done manually)
[x] convert Word to interim CSV (Windows only with MS Word installed)

Parsing:

[x] parse interim CSV using parsing difinitions to get pandas datafames
[x] save dataframes as CSV in processed folder (canonical dataset)
[x] save latest dataset at stable URL

Transformation:

[ ] seasonal adjustment
[ ] diff transformations

Frontend:

[ ] obtain variable desciption and grouping
[ ] plot graphs (sprklines/regular graphs)
[ ] generate frontend as README.md file
[ ] generate html access

End-user:

[ ] use common code to access data in processed folder
[ ] plot visualisations and comments
- in plain code files
- in notebooks
[ ] build forecasting models
[ ] generate reports (PDF/presentations)

This is to discuss a role of interim database.

epogrebnyak commented 7 years ago

My thoughts are still about a minimum working example (MWE) for several parsers that can produce compatible output, and a pipeline to allow them working together. Here is an example of this kind.

End user case - MWE

End user wants to calculate Russian monthly non-oil export and see this figure in roubles. This is a bit simplistic task, but it is two galaxies away from everyday some Excel calculations, just about one. We need something that drags data from different sources.

The formula will be:

EXPORT_EX_OIL = FX_USD * (EXPORT_GOODS_TOTAL - NAT_EXPORT_OIL_t * PRICE_BRENT * CONVERSION_FACTOR)

EXPORT_EX_OIL - non-oil export in rub 
FX_USD - exhange rate, rub/usd 
EXPORT_GOODS_TOTAL - total goods export 
NAT_EXPORT_OIL_t - oil export volume, mln t 
PRICE_BRENT - oil price, usd / barrel
CONVERSION_FACTOR is about 6.3 b / t```

The sources are:

Rosstat (EXPORT_GOODS_TOTAL, FX_USD)
EIA (BRENT)
Bank of Russia (EXPORT_OIL, FX_USD)

Implementation - MWE and extensions

Multiple data sources. Imagine you have working parsers for Rosstat, EIA and Bank of Russia publications. Each parser will produce output in its data/processed folder, some output CSV files. To complete the task the end user queries the URLs of data/processed folders with pd.read_csv and merges the dataframes. The rest is calculation on dataframes.

For this to work well:

variables in different output CSVs have same name convention (relatively easy)
output CSVs must have the same format (easy)
output CSVs are found at stable URLs (easy)
there is a common code that simplifies reading output CSVs fron URLs (easy)
someone tells you there are no conflicts between CSVs (remember FX_USD appears in two sources) - easy with few variables
CSV were updated at the same time (easy first time, more challenging the next time you come back)
before everything, the each parser works right

This is a parser-to-notebook solution, no database, no API.

Single data source. Imagine someone took the burden to collect the output CSV into one dataframe for you and told you this is a your reference dataset, go ahead with it. In other words, someone took care of problems #1, #5, #6, and hopefully, #7. You deal with just one URL, but when needed you can check it at source. This single datasource may be a meta-parser and probably can also be a github repo.

There is still no single database and no API, but this:

gives data to test single database
gives a framework to develop more parsers and validation procedures for parsing results.

Still not convinced where exactly a interim database fits (storing parsing inputs?), but so far @Rotzke says we need one, so I take it for granted.

A kind of little roadmap to keep going I think is the following:

present some existing parsers to the team
unify the parser output format in data/processed
make a dummy version of a 'meta-parser' that merges data/processed outputs to a single CSV
see what kind of information is needed at user front-end (eg variable desciptions, latest values, graphs, etc)
provide common end-user access code to a single CSV set at stable URL (in pandas/R)
make demo Jupiter notebook with calculations using the dataset

From this sceleton we can quickly do a common database, database API and many other magnificent stuff (even an interim database) as well as add more parsers.

Hope someone still wants to do this (this way). ;)

epogrebnyak commented 7 years ago

After 20.06.17 videochat, brief notes: Our pipeline to work with data is the following:

source (an url, a file, group of files, api request)
parser script downloads source, extracts raw data, parses raw data, transforms it to canonical form (CSV or json) and stores output in data/processed folder. there is individual parcer for each source. variable names do not overlap between sources.
aggregator script collects outputs from parcers, validates it (eg checks variable names, but the check will be more extensive as number of sources and variables grows) and uploads it database
the user queries database api to get individual time series or parts of dataset. variables have common nameing pattern. several users can query the api at once.
the user makes useful plots or models using the data from api in Jupiter notebooks (or otherwise)
the is a web frontend for the database with variable names, usage statistics and api guidelines

todo to follow!

ru-stat / data-team-ru-stat

Data pipeline #8

End user case - MWE

Implementation - MWE and extensions