ru-stat / data-team-ru-stat

Team to tackle dataflow in Russian economic statistics (macro, corporate, banking)
0 stars 0 forks source link

List of sources for scrapping #10

Open epogrebnyak opened 7 years ago

epogrebnyak commented 7 years ago

Tentative list and questions: https://gist.github.com/epogrebnyak/ea407b5ddcee6a26a1967d71f248413c

epogrebnyak commented 7 years ago

Scrapper template:

Scrapper name: 
Job: 
Data source: 
- URL: 
- type of source: [Word, Excel, CSV, HTML, XML, API, other]
- frequency: 
Scrapper:
- code:
- control function:
- 
Result: 
- list of variables:
- endpoint URL:
Testing: 
epogrebnyak commented 7 years ago
Scrapper mini-kep
Job Parse sections of Short-term Economic Indicators (KEP) monthly Rosstat publication
Source URL Rosstat KEP page
Source type MS Word
Frequency Monthly
When released Start of month as in schedule
Scrapper code https://github.com/epogrebnyak/mini-kep/tree/master/src/kep
Test health Build Status
Test coverage FIXME: So-so
Documentation FIXME: So-so
Controller TODO: Add here
CSV endpoint https://github.com/epogrebnyak/mini-kep/tree/master/data/processed/latest
List of variables TODO
Frontpage data TODO
Validation TODO

All historic raw data available on internet?

Is scrapper automated (can download required information from internet without manual operations)?

TODO:

DONE:

Akrishna91 commented 7 years ago

I guess the Scrapper code URL should be https://github.com/epogrebnyak/mini-kep/tree/master/src/kep

epogrebnyak commented 7 years ago

Added JSON dunps of data at https://github.com/epogrebnyak/mini-kep/tree/master/data/processed/json. @Rotzke - can you check if you are comfortable importing this?

pd.to_json offers several transformations for Dataframe, I chose orient=records, which seemed most natural, but I'm not sure. For relational database you would have needed rows like:

{"year":1999,"month":1,"EXPORT_GOODS_TOTAL_bln_usd":4.5}
{"year":1999,"month":1,"EXPORT_GOODS_TOTAL_rog":63.7}. etc 

at annual, quarterly and monthly frequency.

Is this enough to start discussing common database structure or you would need mroe info?

@Akrishna91 - corrected URL, thanks!

Rotzke commented 7 years ago

@epogrebnyak Maybe I should tweak the parser to add JSON data directly to DB? Why should we dump it to local files? Except for this everything is fine.

Also, this will be a single collection (relational table) with a separate year in each document (relational row), whole-year statistics and monthly one in the array. Indexed by (year, month), indeed.

@MrBorusLee will create REST API with some endpoints for it after I will have sent him collection structure.

epogrebnyak commented 7 years ago

Also, this will be a single collection (relational table) with a separate year in each document (relational row), whole-year statistics and monthly one in the array. Indexed by (year, month), indeed.

Please note there will be data at annual, quarterly, monthly and daily frequency. The final data will be retrieved by frequency, variable names, year, and, where appropriate, month, quarter, and day. Not sure your database structure currently accounts for this. You may also align it more to parser output structure or to end user query structure, which is not formally described yet.

REST API with some endpoints

What part of data pipeline is this? Is this for collection from parser or for end user?

Why should we dump it to local files?

The parser job ends with data/processed folder as discussed earlier. There will be different parsers and I think it is a mess when they all write to database by themselves. In my view there should be aggregator script that reads parser outputs, verifies them, resolves confilcts with existing data in database, identifies gaps in data and then stores data in database. This way we delimit the responsibility between parser, which must not be aware of database structure and database maintenance scripts.

Instead of JSON parser endpoint you can also read CSV into pandas dataframe by following access_data.get_dfs_from_web(). If you can do this, this eliminates a JSON layer, but you have to think about how you store data from pd.Dataframe to database.

For simple parsers like CBR_USD you can read into database more easily.

Rotzke commented 7 years ago

@epogrebnyak, answering by quotes:

  1. Annual, quarterly, monthly and daily data, in any case, is related to exact year - DB scheme structure is fine, here is a blueprint of it, one possible variant:
    {
    "year":2017,
    "data":[
      {
         "daily":[
            {
               "day":1,
               "vars":[
                  {
                     "test":88
                  }
               ]
            }
         ]
      },
      {
         "monthly":[
            {
               "month":1,
               "vars":[
                  {
                     "test":123
                  }
               ]
            },
            {
               "month":2,
               "vars":[
                  {
                     "test":456
                  }
               ]
            }
         ]
      },
      {
         "quarterly":[
            {
               "quarter":1,
               "vars":[
                  {
                     "test":789
                  }
               ]
            },
            {
               "quarter":2,
               "vars":[
                  {
                     "test":101
                  }
               ]
            }
         ]
      },
      {
         "yearly":{
            "vars":[
               {
                  "test":112
               }
            ]
         }
      }
    ]
    }
  2. Yes it is. 3.1 OK, no problems. 3.2 It is better to read CSV with stock CSV module into dictionary, anyways, will deal with it, no problems as well.
epogrebnyak commented 7 years ago

Some follow up questions:

  1. "day":1, is inconvenient for daily data, because is it usually referenced by date
  2. With database structure you mentioned we will end up writing proprietary CSV import and CSV export procedures, probably in pure python and csv module. Why not use pandas own export/import functions at least for some part of this, may be it can save some time on a prototype stage:
  3. Just to check - if I were writing a (relational/normalised) database structure for this I would use a table for each frequency (annual, qtr, month, daily) and store each datapoint as a row, possibly wrap with SQL Alchemy.

For monthly that would be:

year month varname value date_released source
1999 1 EXPORT_GOODS_TOTAL_bln_usd 4.5 2017-06-26 mini-kep
1999 1 EXPORT_GOODS_TOTAL_rog 63.7 2017-06-26 mini-kep

date_released and source are optional now, but may be required in future to save data vintages.

Rotzke commented 7 years ago
  1. It's just a blueprint, will be heavily modified.
  2. Faster without pandas, plus it has no native Mongo driver.
  3. You will end up with a heavy mess, plus this will require additional normalization to 3 form. NoSQl is a best solution considering on data structure.