List of sources for scrapping

epogrebnyak commented 7 years ago

Tentative list and questions: https://gist.github.com/epogrebnyak/ea407b5ddcee6a26a1967d71f248413c

epogrebnyak commented 7 years ago

Scrapper template:

Scrapper name: 
Job: 
Data source: 
- URL: 
- type of source: [Word, Excel, CSV, HTML, XML, API, other]
- frequency: 
Scrapper:
- code:
- control function:
- 
Result: 
- list of variables:
- endpoint URL:
Testing:

epogrebnyak commented 7 years ago

Scrapper	mini-kep
Job	Parse sections of Short-term Economic Indicators (KEP) monthly Rosstat publication
Source URL	Rosstat KEP page
Source type	MS Word
Frequency	Monthly
When released	Start of month as in schedule
Scrapper code	https://github.com/epogrebnyak/mini-kep/tree/master/src/kep
Test health
Test coverage	FIXME: So-so
Documentation	FIXME: So-so
Controller	TODO: Add here
CSV endpoint	https://github.com/epogrebnyak/mini-kep/tree/master/data/processed/latest
List of variables	TODO
Frontpage data	TODO
Validation	TODO

All historic raw data available on internet?

[ ] Yes
[x] No

Is scrapper automated (can download required information from internet without manual operations)?

[x] Yes
[ ] No

TODO:

[ ] list of variables in dataframe column names, maybe store it in a separate file
[ ] need to export varname, unit desciptions and sections from cfg.py, used in frontpage.
[ ] procedure to test parser result

DONE:

[x] convert CSV to JSON for stable interchange. JSON endpoint added, but CSV seems to be an good source to import data.

Akrishna91 commented 7 years ago

I guess the Scrapper code URL should be https://github.com/epogrebnyak/mini-kep/tree/master/src/kep

epogrebnyak commented 7 years ago

Added JSON dunps of data at https://github.com/epogrebnyak/mini-kep/tree/master/data/processed/json. @Rotzke - can you check if you are comfortable importing this?

pd.to_json offers several transformations for Dataframe, I chose orient=records, which seemed most natural, but I'm not sure. For relational database you would have needed rows like:

{"year":1999,"month":1,"EXPORT_GOODS_TOTAL_bln_usd":4.5}
{"year":1999,"month":1,"EXPORT_GOODS_TOTAL_rog":63.7}. etc

at annual, quarterly and monthly frequency.

Is this enough to start discussing common database structure or you would need mroe info?

@Akrishna91 - corrected URL, thanks!

Rotzke commented 7 years ago

@epogrebnyak Maybe I should tweak the parser to add JSON data directly to DB? Why should we dump it to local files? Except for this everything is fine.

Also, this will be a single collection (relational table) with a separate year in each document (relational row), whole-year statistics and monthly one in the array. Indexed by (year, month), indeed.

@MrBorusLee will create REST API with some endpoints for it after I will have sent him collection structure.

epogrebnyak commented 7 years ago

Also, this will be a single collection (relational table) with a separate year in each document (relational row), whole-year statistics and monthly one in the array. Indexed by (year, month), indeed.

Please note there will be data at annual, quarterly, monthly and daily frequency. The final data will be retrieved by frequency, variable names, year, and, where appropriate, month, quarter, and day. Not sure your database structure currently accounts for this. You may also align it more to parser output structure or to end user query structure, which is not formally described yet.

REST API with some endpoints

What part of data pipeline is this? Is this for collection from parser or for end user?

Why should we dump it to local files?

The parser job ends with data/processed folder as discussed earlier. There will be different parsers and I think it is a mess when they all write to database by themselves. In my view there should be aggregator script that reads parser outputs, verifies them, resolves confilcts with existing data in database, identifies gaps in data and then stores data in database. This way we delimit the responsibility between parser, which must not be aware of database structure and database maintenance scripts.

Instead of JSON parser endpoint you can also read CSV into pandas dataframe by following access_data.get_dfs_from_web(). If you can do this, this eliminates a JSON layer, but you have to think about how you store data from pd.Dataframe to database.

For simple parsers like CBR_USD you can read into database more easily.

Rotzke commented 7 years ago

@epogrebnyak, answering by quotes:

Annual, quarterly, monthly and daily data, in any case, is related to exact year - DB scheme structure is fine, here is a blueprint of it, one possible variant:

{
"year":2017,
"data":[
  {
     "daily":[
        {
           "day":1,
           "vars":[
              {
                 "test":88
              }
           ]
        }
     ]
  },
  {
     "monthly":[
        {
           "month":1,
           "vars":[
              {
                 "test":123
              }
           ]
        },
        {
           "month":2,
           "vars":[
              {
                 "test":456
              }
           ]
        }
     ]
  },
  {
     "quarterly":[
        {
           "quarter":1,
           "vars":[
              {
                 "test":789
              }
           ]
        },
        {
           "quarter":2,
           "vars":[
              {
                 "test":101
              }
           ]
        }
     ]
  },
  {
     "yearly":{
        "vars":[
           {
              "test":112
           }
        ]
     }
  }
]
}

Yes it is. 3.1 OK, no problems. 3.2 It is better to read CSV with stock CSV module into dictionary, anyways, will deal with it, no problems as well.

epogrebnyak commented 7 years ago

Some follow up questions:

"day":1, is inconvenient for daily data, because is it usually referenced by date
With database structure you mentioned we will end up writing proprietary CSV import and CSV export procedures, probably in pure python and csv module. Why not use pandas own export/import functions at least for some part of this, may be it can save some time on a prototype stage:
- hdf5
- to_sql
- read_csv and to_csv
Just to check - if I were writing a (relational/normalised) database structure for this I would use a table for each frequency (annual, qtr, month, daily) and store each datapoint as a row, possibly wrap with SQL Alchemy.

For monthly that would be:

year	month	varname	value	date_released	source
1999	1	EXPORT_GOODS_TOTAL_bln_usd	4.5	2017-06-26	mini-kep
1999	1	EXPORT_GOODS_TOTAL_rog	63.7	2017-06-26	mini-kep

date_released and source are optional now, but may be required in future to save data vintages.

Rotzke commented 7 years ago

It's just a blueprint, will be heavily modified.
Faster without pandas, plus it has no native Mongo driver.
You will end up with a heavy mess, plus this will require additional normalization to 3 form. NoSQl is a best solution considering on data structure.

ru-stat / data-team-ru-stat

List of sources for scrapping #10