A standalone package to scrape financial data from listed Vietnamese companies via Vietstock. If you are looking for raw financial data from listed Vietnamese companies, this may help you.
Because the core components of this project runs on Docker.
Because you will have to build the image from source. I have not released this project's image on Docker Hub yet.
How to get them:
Network
tab, filter only XHR
or Fetch/XHR
requestsXHR
requests, search for the one named financeinfo
, then go to the Cookies
tab underneath; if you cannot find financeinfo
, you can try clicking on the company's "Financials"/"Tài chính" tab and look for itvts_usr_lg
cookie, which is the USER_COOKIE
environment variable in the config__RequestVerificationToken
cookie, which is the REQ_VER_TOKEN_COOKIE
environment variable in the configHeader
tab, look for the form data__RequestVerificationToken
parameter, which is the REQ_VER_TOKEN_POST
environment variable in the configReport type code | Meaning |
---|---|
CTKH |
Financial targets/Chỉ Tiêu Kế Hoạch |
CDKT |
Balance sheet/Cân Đối Kế Toán |
KQKD |
Income statement/Kết Quả Kinh Doanh |
LC |
Cash flow statement/Lưu Chuyển (Tiền Tệ) |
CSTC |
Financial ratios/Chỉ Số Tài Chính |
Report term code | Meaning |
---|---|
1 |
Annually |
2 |
Quarterly |
All core functions are located within the functions_vietstock
folder and so are the scraped files; thus, from now on, references to the functions_vietstock
folder will be simply put as ./
.
It should be in this area:
...
functions-vietstock:
build: .
container_name: functions-vietstock
command: wait-for-it -s scraper-redis:6379 -t 600 -- bash
stdin_open: true
tty: true
environment:
- REDIS_HOST=scraper-redis
- REQ_VER_TOKEN_POST=
- REQ_VER_TOKEN_COOKIE=
- USER_COOKIE=
...
At the project folder, run:
docker-compose build --no-cache && docker-compose up -d
Next, open the scraper container in another terminal:
docker exec -it functions-vietstock ./userinput.sh
Note: To stop the scraping, stop the userinput script terminal, then open another terminal and run:
docker exec -it functions-vietstock ./celery_stop.sh
to clean everything related to the scraping process (local scraped files are intact).
Some quesitons require you to answer in a specific syntax, as follows:
Do you wish to scrape by a specific business type-industry or by tickers? [y for business type-industry/n for tickers]
y
, the next prompt is: Enter business type ID and industry ID combination in the form of businesstype_id;industry_id:
bizType_ind_tickers.csv
in the scrape result folder (./localData/overview
).businesstype_id;industry_id
.n
, the next prompts ask for ticker(s)
bizType_ind_tickers.csv
fileticker
: a ticker symbol or a list of ticker symbols of your choice. You can enter either ticker_1
or ticker_1,ticker_2
Whether you chose scrape by business type-industry or tickers, you will receive a prompt for report type(s), report term(s) and page:
report_type and report_term : use the report type codes and report term codes in the following tables (which was already mentioned above). You can enter either report_type_1 or report_type_1,report_type_2 . Same goes for report term.
Report type code |
Meaning |
---|---|
CTKH |
Financial targets/Chỉ Tiêu Kế Hoạch |
CDKT |
Balance sheet/Cân Đối Kế Toán |
KQKD |
Income statement/Kết Quả Kinh Doanh |
LC |
Cash flow statement/Lưu Chuyển (Tiền Tệ) |
CSTC |
Financial ratios/Chỉ Số Tài Chính |
Report term code | Meaning |
---|---|
1 |
Annually |
2 |
Quarterly |
page
: the page number for the scrape, this is optional. If omitted, the scraper will start from page 1
Maybe you do not want to spend time building the image, and just want to play around with the code.
In your virtual environment of choice, install all requirements:
pip install -r requirements.txt
Nagivate to the functions_vietstock
folder, create a file named .env
with the following content (you can use the .example_env
file as an example):
REDIS_HOST=localhost
REQ_VER_TOKEN_POST=YOUR_REQ_VER_TOKEN_POST
REQ_VER_TOKEN_COOKIE=YOUR_REQ_VER_TOKEN_COOKIE
USER_COOKIE=YOUR_USER_COOKIE
You still need to run the Redis server inside a container:
docker run -d -p 6379:6379 --rm --name scraper-redis redis:6.2
Go to the functions_vietstock
folder:
cd functions_vietstock
Run the celery_stop.sh
script:
./celery_stop.sh
Use the ./userinput.sh
script to scrape as in the previous section.
If you chose to scrape a list of all business types, industries and their tickers, the result is stored in the ./localData/overview
folder, under the file name bizType_ind_tickers.csv
.
ticker,biztype_id,bizType_title,ind_id,ind_name
BID,3,Bank,1000,Finance and Insurance
CTG,3,Bank,1000,Finance and Insurance
VCB,3,Bank,1000,Finance and Insurance
TCB,3,Bank,1000,Finance and Insurance
...
FinanceInfo results are stored in the ./localData/financeInfo
folder, and each file is the form ticker_reportType_reportTermName_page.json
, representing a ticker - report type - report term - page instance.
[
[
{
"ID": 4,
"Row": 4,
"CompanyID": 2541,
"YearPeriod": 2017,
"TermCode": "N",
"TermName": "Năm",
"TermNameEN": "Year",
"ReportTermID": 1,
"DisplayOrdering": 1,
"United": "HN",
"AuditedStatus": "KT",
"PeriodBegin": "201701",
"PeriodEnd": "201712",
"TotalRow": 14,
"BusinessType": 1,
"ReportNote": null,
"ReportNoteEn": null
},
{
"ID": 3,
"Row": 3,
"CompanyID": 2541,
"YearPeriod": 2018,
"TermCode": "N",
"TermName": "Năm",
"TermNameEN": "Year",
"ReportTermID": 1,
"DisplayOrdering": 1,
"United": "HN",
"AuditedStatus": "KT",
"PeriodBegin": "201801",
"PeriodEnd": "201812",
"TotalRow": 14,
"BusinessType": 1,
"ReportNote": null,
"ReportNoteEn": null
},
{
"ID": 2,
"Row": 2,
"CompanyID": 2541,
"YearPeriod": 2019,
"TermCode": "N",
"TermName": "Năm",
"TermNameEN": "Year",
"ReportTermID": 1,
"DisplayOrdering": 1,
"United": "HN",
"AuditedStatus": "KT",
"PeriodBegin": "201901",
"PeriodEnd": "201912",
"TotalRow": 14,
"BusinessType": 1,
"ReportNote": null,
"ReportNoteEn": null
},
{
"ID": 1,
"Row": 1,
"CompanyID": 2541,
"YearPeriod": 2020,
"TermCode": "N",
"TermName": "Năm",
"TermNameEN": "Year",
"ReportTermID": 1,
"DisplayOrdering": 1,
"United": "HN",
"AuditedStatus": "KT",
"PeriodBegin": "202001",
"PeriodEnd": "202112",
"TotalRow": 14,
"BusinessType": 1,
"ReportNote": null,
"ReportNoteEn": null
}
],
{
"Balance Sheet": [
{
"ID": 1,
"ReportNormID": 2995,
"Name": "TÀI SẢN ",
"NameEn": "ASSETS",
"NameMobile": "TÀI SẢN ",
"NameMobileEn": "ASSETS",
"CssStyle": "MaxB",
"Padding": "Padding1",
"ParentReportNormID": 2995,
"ReportComponentName": "Cân đối kế toán",
"ReportComponentNameEn": "Balance Sheet",
"Unit": null,
"UnitEn": null,
"OrderType": null,
"OrderingComponent": null,
"RowNumber": null,
"ReportComponentTypeID": null,
"ChildTotal": 0,
"Levels": 0,
"Value1": null,
"Value2": null,
"Value3": null,
"Value4": null,
"Vl": null,
"IsShowData": true
},
{
"ID": 2,
"ReportNormID": 3000,
"Name": "A. TÀI SẢN NGẮN HẠN",
"NameEn": "A. SHORT-TERM ASSETS",
"NameMobile": "A. TÀI SẢN NGẮN HẠN",
"NameMobileEn": "A. SHORT-TERM ASSETS",
"CssStyle": "LargeB",
"Padding": "Padding1",
"ParentReportNormID": 2996,
"ReportComponentName": "Cân đối kế toán",
"ReportComponentNameEn": "Balance Sheet",
"Unit": null,
"UnitEn": null,
"OrderType": null,
"OrderingComponent": null,
"RowNumber": null,
"ReportComponentTypeID": null,
"ChildTotal": 25,
"Levels": 1,
"Value1": 4496051.0,
"Value2": 4971364.0,
"Value3": 3989369.0,
"Value4": 2142717.0,
"Vl": null,
"IsShowData": true
},
...
Please note that you have to determine whether the order of the financial values match the order of the periods
Scrape logs are stored in the ./logs
folder, in the form of scrapySpiderName_log_verbose.log
.
Error logs are stored in the ./logs
folder, in the form of scrapySpiderName_reportType_spidererrors_short.log
. For now, error logs are used only for financeInfo Spider.
"Redis is an open source (BSD licensed), in-memory data structure store, used as a database, cache, and message broker." See: https://redis.io/. In this project, Redis serves as a message broker and an in-memory queue for Scrapy. No non-standard Redis configurations were made for this project.
To open an interactive shell with Redis, you have to enter the container first:
docker exec -it functions-vietstock bash
Then:
redis-cli -h scraper-redis
To open an interactive shell with Redis:
docker exec -it scraper-redis redis-cli
Look inside each log file.
This scraper utilizes scrapy-redis and Redis to crawl and scrape tickers' information from a top-down approach (going from business types, then industries, then tickers in each business type-industry combination) by passing necessary information into Redis queues for different Spiders to consume.