epogrebnyak commented 7 years ago

Agenda:

project requirements
design and technical solutions considerations

Venue: https://hangouts.google.com/hangouts/_/r7fusbkh3neivjcbabvqgxigsme

Discussion materials:

to be published here before the session

Mission

Enable access to continiously updated macroeconomic, corporate and bank data for Russia.

1. Macroeconomic time series

User cases

User 1 ('Sasha') is a data analyst. He is proficient is R/pandas and likes CSV files better than Excel. He knows to demostrate his work in Jupiter notebooks, willing to upload to Github and wants his graphs and models to be continiously updated without him going to download data manualy. He worked with Quandl and FRED and at work he has paid access to CEIC and Blooomberg.

User 2 ('Vera') is also a researcher. She is confident with Excel, so that she can do all of her work without depending on programming. She maintains er universe of data, graphs and models as a collection of local files. Her boss is Excel user too, so there is little incentive for Vera in going to R/pandas. She would consider downloading the initial data in Excel file, eg. exhcange rates or interest rates, but only once she trusts the data source. She tried some Excel add-ons, but did not like them too much.

User 3' ('Simon') is a journalist. He covers Russia economics and once in a while makes comments about data releases. He calls analysts in banks and research centers to find out thier views about newest inflation or household income figure and can deliever plain-word on economic story. He is BA in Medieval Literature, so he is not into Excel at all. Simon would rather click on some graphs allowing him to interrogate analysts better. He is also keen to post his story on FB or Tweeter before other journalists do.

User 4 is a Robot. It learned we have some great open data and wants to borrow from us.

User 5 is a boss's secretary (give name). She wants to pass over a printout to a boss, with "all latest macro", so the boss feels confident at a meeting. Neither is an expert in macroeconomics, but the print out allows asking questions and sustain some discussion. See nice example of "all macro" at NY Fed.

Data access methods by users:

Principal:

work with API, download latest data to local file (Sasha and Robot)
download CSV file from stable URL, save to local file (Sasha)

Additional:

download Excel file (Vera)
repost a graph on Facebook (Simon)
print "all macro" from PDF file (Sec)

Data structure

Dataset is a collection of dataframes at different frequencies ("aqmwd")
Variable labels contain varname and unit of measurement (e.g. EXPORT_GOODS_bln_rub)
Dataset released at specific date (year, month ususally) is called a vintage

A macroeconomic time series dataset is a pandas/R dataframe with variable names as column names and timestamps as row heads. At monthly frequency it looks something like this:

<insert example here>

Frequencies: The dataframe may be is at annual, quarterly, monthly, weekly or daily frequency. Some variables are annual and by quarter (GDP), some are found at all frequencies (exchage rate). Lets denote frequencies by "aqmwd". For each frequency there is a own dataframe. Cannot mix frequencies in single dataframe.

Files: A dataframe is dumped to CSV file, found at stable URL or retrieved by API.

Namespace: variable label consists of two parts: varname in capital letters (GDP) and unit of measurement in lowercase (rog), joined by single underscores (GDP_rog). Parts my have own underscores. Along the code I use label for full variable name, varname is upppercase part and unit for lowercase.

Vintage: a dataset is realeased at some date if called a vintage. Some of macroeconomic data gets revised (like GDP), so accuracy of first estimate (GDP for 2017Q1 released in May or June 2017) compared to final figure (GDP for 2017Q1 published to the end of 2017) is subject to professional discussion. For many end-users they just want the latest values.

User scenarios

The user wants to browse a description of data, like variable names and units of measurement:

GDP_rog                GDP, rate of growth to previous period
CPI_yoy                Consumer prce index, change to year earlier  
EXPORT_GOODS_bln_usd   Export of goods, bln USD 
USD_rub_eop            USD-Rouble exchange rate, rub, end of period 
USD_rub_avg            USD-Rouble exchange rate, rub, period average

The user may also appreciate to see latest values and quickly browse the smaller or larger data graphs.
The user reads of downloads data by API or a CSV at stable URL - the latest vintage or a specified release

List of data sources

Agencies:

Rosstat
- KEP in MS Word - two generations of parsers: KEP1 superceded by KEP2-mini
- SEP in HTML - not currently parsed
- Regional stats in Excel - parsed, but abandoned:
Bank of Russia (html, api, Excel)
Minfin/Treasury (Excel, html)
quandl (for checks, API)
IEA (oil prices, API)
MOEX?

TODO: Add links to sources and repos, ask Marcel to review/comment, make this a table?

Wins

parsing SEP of the fly can make it to news
add seasonal adjustment
replacate some of offical Ministry of Economy presentations with updated values

Risks and checks

Is this data valid? Haven't you screwd anythinging while parsing?
Is there enough variables?
I saw something on the other web site and value is different. I do not understand your data.
Time series is OK, but what about forms? Table has many time series values for one date.
Noone wants our API, nit enough users to get excited about feedbakc and get 'problem solved' feeling

Extentions

Some time series exhibit seasonality, but there are few official estimates of seasonally adjused data. It makes a good extension to the dataset. See detrending in R.

Delete upon review:

Usercase for the project, based on:

2. Corporate reports

https://github.com/epogrebnyak/data-rosstat-boo-2013 - machine-readable dataset of 2012-2015 Russian enterprises financial reports

3. Bank reports

https://github.com/epogrebnyak/cbr-bankform-reader - Reads bank form data from DBF and text files. Emits clean data by row, storable in a database. Supports form 101 and 102.

Rotzke commented 7 years ago

Well, Vera can use the same CSV as Sasha in Excel :) Would suggest:

API for Robot and Sasha;
CSV for Sasha and Vera;
Google Charts on website with sharing option for Simon;

epogrebnyak commented 7 years ago

You have to know Vera ) she wants three frequencies of variables on different sheets followed by sheet with varnames. Cannot deny what a woman wants.

But to sort things out the priority is Sasha and Robot, and Vera and Simon as add-on features.

18 июн. 2017 г. 11:44 пользователь "Nikita Rotsky" notifications@github.com написал:

Well, Vera can use the same .CSV as Sasha in Excel :) Would suggest:

API for Robot and Sasha;

CSV for Sasha and __Vera;

Google Charts on website with sharing option for Simon;

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/epogrebnyak/data-team-ru-stat/issues/6#issuecomment-309264698, or mute the thread https://github.com/notifications/unsubscribe-auth/AI1grnGrdh-Mxq4D5pkh3UIrnd1L6SaOks5sFONkgaJpZM4N8Tt2 .

Rotzke commented 7 years ago

@epogrebnyak Oh, OK, if it is mandatory we could use something like "xlsxwriter" library to convert CSV into XLS to please this woman ;)

Agree with priorities.

rohlin17 commented 7 years ago

I spent six years in the Financial and Economic Department of one of the Top 50 Russian banks. Excel is absolutely reporting standard. Yes, we developed a big data warehouse and quite complex Business Intelligence system, but Vera will use use Excel forever - she downloads Excel reports from BI system and slice and dice them like she wants on her desktop computer. And here Excel is the best.

epogrebnyak commented 7 years ago

Todo after the live session:

write out a sample graph/pipline of 2-3 parsers to database to API (EP+Marcel)
think of datachecks: automated validation + 'human eye' (EP+Marcel)
required AWS instances (Nikita+Boris+Dmitry)

Is this correct?

Rotzke commented 7 years ago

Actually, write 1 pipeline with requirements :)

Everything else is correct.

ru-stat / data-team-ru-stat

Live session to discuss project requirements and design Sun, June 18 21:30 (GMT+5:30) 19:00 (GMT+3) #6

Mission

1. Macroeconomic time series

User cases

Data structure

User scenarios

List of data sources

Wins

Risks and checks

Extentions

Delete upon review:

2. Corporate reports

3. Bank reports