ru-stat / data-team-ru-stat

Team to tackle dataflow in Russian economic statistics (macro, corporate, banking)
0 stars 0 forks source link

Data pipeline #8

Open epogrebnyak opened 7 years ago

epogrebnyak commented 7 years ago

Our project is about aggregating data from individual parsers under common namespace and releasing the data through final API (correct if something missing):

  1. Raw data source - a file or a web page at Rosstat or other agency.
  2. Parser reads source and provides resulting CSVs in its data/processed folder.
  3. Some automated and human eye validation is done during/after parsing.
  4. 'daemon' collects data/processed CSVs from several parsers in common database.
  5. The user queries database through API to get aggregated data from parsers (the simplest call returns aggregate CSV with variables from diffrernt parsers).
  6. The user reads data from from URL or stores data as a local file (using standard code in R/pandas or designated package).
  7. The user writes code in Jupiter notebook with R or pandas using the data.

Comments welcome.

neotheicebird commented 7 years ago

Maybe release python and/or R api-client for users to access the API easily from jupyter notebooks?

Rotzke commented 7 years ago

@neotheicebird Yeah, we discussed it at the meeting - using AWS API.

neotheicebird commented 7 years ago

@Rotzke awesome! Just to keep us on the same page, I mean a python/R client side library apart from the web API development. Thanks

Rotzke commented 7 years ago

@neotheicebird Up to you, good sir :) Created a new issue on teams.

epogrebnyak commented 7 years ago

@neotheicebird - al least some standard code to access the data will be very useful. In pandas we have somehting like:

dfm = pd.read_csv(url_m, converters = {'time_index':pd.to_datetime}, 
                                           index_col = 'time_index')

This works to read monthly data from stable URL, it is slow to query internet every time we run the program so may have some class to load/update data, similar to one below (from here):

class LocalDataset():

    def __init__(self, _id):        
        self._id = _id
        try:
            self.ts = get_local_data_as_series(_id)
        except:
            print("Cannot load from file: " + self.filename)
            self.update()

    def update(self):
       self.ts = get_data_as_series(self._id)
       save_local_data(self._id, self.ts)
       return self

Maybe this can be a client/small librabry/pypi package, but as far we can do a just some preferred code to download and manipulate the data. Updated pipeline in indicate this.

neotheicebird commented 7 years ago

Awesome, didn't know about pd.read_csv having an URL arg

neotheicebird commented 7 years ago

@epogrebnyak the code example and a simple pypi package to access API sounds good

Rotzke commented 7 years ago

@neotheicebird @epogrebnyak Guys, we have Slack for chatting! :)

epogrebnyak commented 7 years ago

Based on discussion with @Rotzke, updated pipeline:

  1. Scrapers collect source files/web pages from Rosstat or other agencies to document file store / database (Mongo, a NoSQL database). This is raw immutable data.
  2. Parsers take raw files from filestore / database and make processed files (CSV or other).
  3. Automated and human eye validators check parsing result.
  4. Aggregator daemon collects data from processed files by different parsers to a production database.
  5. Web frontend gives end users database overview and access instructions.
  6. End users apply standard code or packages in R/pandas to access production database (may also cache data to a local file):
    • by reading a downloadable CSV dumps at stable URL
    • by end-user API
  7. The user writes code in Jupiter notebook with R or pandas using the data.
  8. Visualization and data analysis by end user impress everyone.
  9. ...
  10. Profit! ;-)
epogrebnyak commented 7 years ago

Some more detail on pipeline, based on mini-kep:

Raw data:

Parsing:

Transformation:

Frontend:

End-user:

This is to discuss a role of interim database.

epogrebnyak commented 7 years ago

My thoughts are still about a minimum working example (MWE) for several parsers that can produce compatible output, and a pipeline to allow them working together. Here is an example of this kind.

End user case - MWE

End user wants to calculate Russian monthly non-oil export and see this figure in roubles. This is a bit simplistic task, but it is two galaxies away from everyday some Excel calculations, just about one. We need something that drags data from different sources.

The formula will be:

EXPORT_EX_OIL = FX_USD * (EXPORT_GOODS_TOTAL - NAT_EXPORT_OIL_t * PRICE_BRENT * CONVERSION_FACTOR)

EXPORT_EX_OIL - non-oil export in rub 
FX_USD - exhange rate, rub/usd 
EXPORT_GOODS_TOTAL - total goods export 
NAT_EXPORT_OIL_t - oil export volume, mln t 
PRICE_BRENT - oil price, usd / barrel
CONVERSION_FACTOR is about 6.3 b / t```

The sources are:

Implementation - MWE and extensions

Multiple data sources. Imagine you have working parsers for Rosstat, EIA and Bank of Russia publications. Each parser will produce output in its data/processed folder, some output CSV files. To complete the task the end user queries the URLs of data/processed folders with pd.read_csv and merges the dataframes. The rest is calculation on dataframes.

For this to work well:

  1. variables in different output CSVs have same name convention (relatively easy)
  2. output CSVs must have the same format (easy)
  3. output CSVs are found at stable URLs (easy)
  4. there is a common code that simplifies reading output CSVs fron URLs (easy)
  5. someone tells you there are no conflicts between CSVs (remember FX_USD appears in two sources) - easy with few variables
  6. CSV were updated at the same time (easy first time, more challenging the next time you come back)
  7. before everything, the each parser works right

This is a parser-to-notebook solution, no database, no API.

Single data source. Imagine someone took the burden to collect the output CSV into one dataframe for you and told you this is a your reference dataset, go ahead with it. In other words, someone took care of problems #1, #5, #6, and hopefully, #7. You deal with just one URL, but when needed you can check it at source. This single datasource may be a meta-parser and probably can also be a github repo.

There is still no single database and no API, but this:

Still not convinced where exactly a interim database fits (storing parsing inputs?), but so far @Rotzke says we need one, so I take it for granted.

A kind of little roadmap to keep going I think is the following:

  1. present some existing parsers to the team
  2. unify the parser output format in data/processed
  3. make a dummy version of a 'meta-parser' that merges data/processed outputs to a single CSV
  4. see what kind of information is needed at user front-end (eg variable desciptions, latest values, graphs, etc)
  5. provide common end-user access code to a single CSV set at stable URL (in pandas/R)
  6. make demo Jupiter notebook with calculations using the dataset

From this sceleton we can quickly do a common database, database API and many other magnificent stuff (even an interim database) as well as add more parsers.

Hope someone still wants to do this (this way). ;)

epogrebnyak commented 7 years ago

After 20.06.17 videochat, brief notes: Our pipeline to work with data is the following:

todo to follow!