Closed epogrebnyak closed 7 years ago
I am in :)
I am able to join yeah!
Also, guys, please join our Slack team (not many people there for some reason): https://join.slack.com/openstat-team/shared_invite/MTk4MTUzODM3NTM4LTE0OTc1MTczODctMThhNTE3Yzc1NA
Am in too :)
I'm in
I'm in
Dear all, here is some material for requiremets, to be continued. Please put in some sort of like / dislike once you read it, but let's discuss it at session. More parts to follow.
Enable access to continiously updated macroeconomic, corporate and bank data for Russia.
User 1 ('Sasha') is a data analyst. He is proficient is R/pandas and likes CSV files better than Excel. He knows to demostrate his work in Jupiter notebooks, willing to upload to Github and wants his graphs and models to be continiously updated without him going to download data manualy. He worked with Quandl and FRED and at work he has paid access to CEIC and Blooomberg.
User 2 ('Vera') is also a researcher. She is confident with Excel, so that she can do all of her work without depending on programming. She maintains er universe of data, graphs and models as a collection of local files. Her boss is Excel user too, so there is little incentive for Vera in going to R/pandas. She would consider downloading the initial data in Excel file, eg. exhcange rates or interest rates, but only once she trusts the data source. She tried some Excel add-ons, but did not like them too much.
User 3' ('Simon') is a journalist. He covers Russia economics and once in a while makes comments about data releases. He calls analysts in banks and research centers to find out thier views about newest inflation or household income figure and can deliever plain-word on economic story. He is BA in Medieval Literature, so he is not into Excel at all. Simon would rather click on some graphs allowing him to interrogate analysts better. He is also keen to post his story on FB or Tweeter before other journalists do.
User 4 is a Robot. It learned we have some great open data and wants to borrow from us.
User 5 is a boss's secretary (give name). She wants to pass over a printout to a boss, with "all latest macro", so the boss feels confident at a meeting. Neither is an expert in macroeconomics, but the print out allows asking questions and sustain some discussion. See nice example of "all macro" at NY Fed.
Data access methods by users:
Principal:
Additional:
A macroeconomic time series dataset is a pandas/R dataframe with variable names as column names and timestamps as row heads. At monthly frequency it looks something like this:
<insert example here>
Frequencies: The dataframe may be is at annual, quarterly, monthly, weekly or daily frequency. Some variables are annual and by quarter (GDP), some are found at all frequencies (exchage rate). Lets denote frequencies by "aqmwd". For each frequency there is a own dataframe. Cannot mix frequencies in single dataframe.
Files: A dataframe is dumped to CSV file, found at stable URL or retrieved by API.
Namespace: variable label consists of two parts: varname in capital letters (GDP) and unit of measurement in lowercase (rog), joined by single
underscores (GDP_rog). Parts my have own underscores. Along the code I use label
for full variable name, varname
is
upppercase part and unit
for lowercase.
Vintage: a dataset is realeased at some date if called a vintage. Some of macroeconomic data gets revised (like GDP), so accuracy of first estimate (GDP for 2017Q1 released in May or June 2017) compared to final figure (GDP for 2017Q1 published to the end of 2017) is subject to professional discussion. For many end-users they just want the latest values.
The user wants to browse a description of data, like variable names and units of measurement:
GDP_rog GDP, rate of growth to previous period
CPI_yoy Consumer prce index, change to year earlier
EXPORT_GOODS_bln_usd Export of goods, bln USD
USD_rub_eop USD-Rouble exchange rate, rub, end of period
USD_rub_avg USD-Rouble exchange rate, rub, period average
The user may also appreciate to see latest values and quickly browse the smaller or larger data graphs.
The user reads of downloads data by API or a CSV at stable URL - the latest vintage or a specified release
Agencies:
TODO: Add links to sources and repos, ask Marcel to review/comment, make this a table?
Usercase for the project, based on:
https://github.com/epogrebnyak/data-rosstat-boo-2013 - machine-readable dataset of 2012-2015 Russian enterprises financial reports
https://github.com/epogrebnyak/cbr-bankform-reader - Reads bank form data from DBF and text files. Emits clean data by row, storable in a database. Supports form 101 and 102.
Well, Vera can use the same CSV as Sasha in Excel :) Would suggest:
You have to know Vera ) she wants three frequencies of variables on different sheets followed by sheet with varnames. Cannot deny what a woman wants.
But to sort things out the priority is Sasha and Robot, and Vera and Simon as add-on features.
18 июн. 2017 г. 11:44 пользователь "Nikita Rotsky" notifications@github.com написал:
Well, Vera can use the same .CSV as Sasha in Excel :) Would suggest:
- API for Robot and Sasha;
- CSV for Sasha and __Vera;
- Google Charts on website with sharing option for Simon;
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/epogrebnyak/data-team-ru-stat/issues/6#issuecomment-309264698, or mute the thread https://github.com/notifications/unsubscribe-auth/AI1grnGrdh-Mxq4D5pkh3UIrnd1L6SaOks5sFONkgaJpZM4N8Tt2 .
@epogrebnyak Oh, OK, if it is mandatory we could use something like "xlsxwriter" library to convert CSV into XLS to please this woman ;)
Agree with priorities.
I spent six years in the Financial and Economic Department of one of the Top 50 Russian banks. Excel is absolutely reporting standard. Yes, we developed a big data warehouse and quite complex Business Intelligence system, but Vera will use use Excel forever - she downloads Excel reports from BI system and slice and dice them like she wants on her desktop computer. And here Excel is the best.
Todo after the live session:
Is this correct?
Actually, write 1 pipeline with requirements :)
Everything else is correct.
Agenda:
Venue: https://hangouts.google.com/hangouts/_/r7fusbkh3neivjcbabvqgxigsme
Discussion materials:
Suggested reading:
Time zones: 21:30 (GMT+5:30) - India, 19:00 (GMT+3) - Moscow, Dnipro
Please drop a note below if you are able to join.