vincetran96 / Scrape-Finance-Data-v2

A standalone package to scrape financial data from listed Vietnamese companies via Vietstock
MIT License
52 stars 29 forks source link

Endless loop of downloading #5

Closed truongphanduykhanh closed 3 years ago

truongphanduykhanh commented 3 years ago

When I select the option scraping with business ID and industry ID, the execution goes into an endless loop. The downloaded json files are continuously replaced by new ones with exactly the same names.

Haven't checked if the issue happens for mass scraping or other options.

vincetran96 commented 3 years ago

Did the program not terminate?

Also, have you inspected the log files to see if it scraped the same tickers and/or the same pages of such tickers?

truongphanduykhanh commented 3 years ago

The program did not terminate by itself. I checked the log, indeed it keeps overwriting new files.

For example, there are 50 lines of scraping CDKT page 8 of ticker NSC:

2021-08-26 07:27:23 [financeInfo] INFO: On page 8 of CDKT for NSC 2021-08-26 07:32:14 [financeInfo] INFO: On page 8 of CDKT for NSC 2021-08-26 07:40:17 [financeInfo] INFO: On page 8 of CDKT for NSC ... 2021-08-26 09:34:06 [financeInfo] INFO: On page 8 of CDKT for NSC ... 2021-08-26 10:26:31 [financeInfo] INFO: On page 8 of CDKT for NSC

vincetran96 commented 3 years ago

Interesting, thanks for pointing it out. I will have a look when I get the time, as currently I'm working full time and having other projects to catch up 😄

vincetran96 commented 3 years ago

@truongphanduykhanh I just re-built and ran it - the program successfully terminated after about 254 seconds. For the record, I used business-industry combination 1;100, report types CDKT,KQKD and report term 1. Please let me know your selected options that caused the bug.

truongphanduykhanh commented 3 years ago

Thanks for your effort figure it out. I have discovered 2 issues:

  1. Mass Scraping: There are more than 3,000 tickers in file bizType_ind_tickers.csv. However, here are only ~600 tickers dowloaded when I mass scrap (it took 5 hours to scrap those 600 tickers with my internet connection). It missed many bluechips, whose information are certainly available on VietStock such as VIC, GAS. Following are count summary of total tickers and downloaded tickers for mass scraping option:
biztype_id ind_id ticker ticker_download
TOTAL 3129 657
0 1 100 136 110
1 1 200 81 30
2 1 300 171 49
3 1 400 598 49
4 1 500 903 50
5 1 600 192 51
6 1 700 67 11
7 1 800 221 50
8 1 900 84 24
9 1 1000 22 0
10 1 1100 3 0
11 1 1200 66 12
12 1 1300 5 0
13 1 1400 66 7
14 1 1500 4 0
15 1 1600 5 0
16 1 1700 7 0
17 1 1800 43 0
18 1 1900 6 0
19 1 2000 4 0
20 2 1000 105 103
21 3 1000 75 60
22 4 900 1 0
23 4 1000 39 24
24 4 1600 1 0
25 5 1000 32 19
26 6 1000 2 1
27 6 2000 2 1
28 7 1000 7 6
29 8 1200 181 0

Following are terminal record:

Do you wish to mass scrape? [y/n] y
Do you wish clear ALL scraped files and kill ALL running Celery workers? [y/n] y
Clearing scraped files and all running workers, please wait...

OK
rm: cannot remove './run/celery/*': No such file or directory
rm: cannot remove './run/scrapy/*': No such file or directory
rm: cannot remove './logs/*': No such file or directory

Do you wish to start mass scraping now? Process will automatically exit when finished. [y] y
Creating Celery workers...
Error: No nodes replied within time constraint
Waiting for Celery workers to be online...
Error: No nodes replied within time constraint
Waiting for Celery workers to be online...
Error: No nodes replied within time constraint
Waiting for Celery workers to be online...
Waiting for Celery workers to be online...
Running Celery tasks for mass scrape...
Scrapy is still running...
Scrapy is still running...
Scrapy is still running...
Scrapy is still running...
Scrapy is still running...
...
Exit
  1. Biz-Indsustry Scarping: It runs into an endless loop of overwriting new files to downloaded files. My options was: 1;100, CTKH,CDKT, KQKD,LC,CSTC, 1,2
vincetran96 commented 3 years ago

I can confirm the endless loop issue is happening for me too. I will see what I can do.

Regarding the mass scrape, can you please raise another separate issue? My intuition tells me that it can be due to a network problem.

truongphanduykhanh commented 3 years ago

Appreciate your support Vince!

My intention was to scrape data for a machine learning project. So mass scraping is the best option.

I would manually select all the combinations of business-industry if not all 3,000 tickers are scraped by the mass option.

vincetran96 commented 3 years ago

I see. I forgot to mention in the README that I have not tried/tested the mass scraping functionality, because there are too many tickers and financial reports to scrape. You can absolutely go that route if your research is about the whole market, otherwise I would suggest you select and scrape only the industries that you're interested in, just to avoid bombarding Vietstock with http requests 😄.

vincetran96 commented 3 years ago

@truongphanduykhanh In the meantime, if you don't mind, please scrape the report types (e.g., CDKT, KQKD, etc.) separately. There seems to be a problem with either the LC or CSTC report that makes the scraper stuck in an endless loop. I have not found out what.

truongphanduykhanh commented 3 years ago

Got it. Thanks. Even without fixing it yet, your response speed is already quite a phenomenon.

vincetran96 commented 3 years ago

@truongphanduykhanh I've attempted to "workaround" this issue by requiring the scraper to check for pages that are already crawled. Please try pulling the revamp branch and run locally (you may want to run without docker-compose to save time - see README for details) to see if the issue still persists. I myself is letting my computer run overnight too.

Edit: it terminated successfully using your mentioned scraping options (1;100 CTKH,CDKT,KQKD,LC,CSTC 1,2)

truongphanduykhanh commented 3 years ago

I've tried with options (1;100 CTKH,CDKT,KQKD,LC,CSTC 1,2). It worked perfectly!

Indeed, there're 136 tickers of {biz:1; ind:100} in bizType_ind_tickers.csv. The options scrap 111 tickers of them. I've checked manually the 25 tickers remaining and they don't have data on Vietstock.

I'll try all other combinations of biz;ind. Thank you so much.

vincetran96 commented 3 years ago

Awesome. I will merge the other branch into master later. Thanks!

vincetran96 commented 3 years ago

@truongphanduykhanh If you use this program in any of your research, please credit it where applicable. I'd appreciate it!

truongphanduykhanh commented 3 years ago

Definitely Vince.