Closed truongphanduykhanh closed 3 years ago
Did the program not terminate?
Also, have you inspected the log files to see if it scraped the same tickers and/or the same pages of such tickers?
The program did not terminate by itself. I checked the log, indeed it keeps overwriting new files.
For example, there are 50 lines of scraping CDKT page 8 of ticker NSC:
2021-08-26 07:27:23 [financeInfo] INFO: On page 8 of CDKT for NSC 2021-08-26 07:32:14 [financeInfo] INFO: On page 8 of CDKT for NSC 2021-08-26 07:40:17 [financeInfo] INFO: On page 8 of CDKT for NSC ... 2021-08-26 09:34:06 [financeInfo] INFO: On page 8 of CDKT for NSC ... 2021-08-26 10:26:31 [financeInfo] INFO: On page 8 of CDKT for NSC
Interesting, thanks for pointing it out. I will have a look when I get the time, as currently I'm working full time and having other projects to catch up 😄
@truongphanduykhanh I just re-built and ran it - the program successfully terminated after about 254 seconds. For the record, I used business-industry combination 1;100
, report types CDKT,KQKD
and report term 1
. Please let me know your selected options that caused the bug.
Thanks for your effort figure it out. I have discovered 2 issues:
bizType_ind_tickers.csv
. However, here are only ~600 tickers dowloaded when I mass scrap (it took 5 hours to scrap those 600 tickers with my internet connection). It missed many bluechips, whose information are certainly available on VietStock such as VIC, GAS. Following are count summary of total tickers and downloaded tickers for mass scraping option:biztype_id | ind_id | ticker | ticker_download | |
---|---|---|---|---|
TOTAL | 3129 | 657 | ||
0 | 1 | 100 | 136 | 110 |
1 | 1 | 200 | 81 | 30 |
2 | 1 | 300 | 171 | 49 |
3 | 1 | 400 | 598 | 49 |
4 | 1 | 500 | 903 | 50 |
5 | 1 | 600 | 192 | 51 |
6 | 1 | 700 | 67 | 11 |
7 | 1 | 800 | 221 | 50 |
8 | 1 | 900 | 84 | 24 |
9 | 1 | 1000 | 22 | 0 |
10 | 1 | 1100 | 3 | 0 |
11 | 1 | 1200 | 66 | 12 |
12 | 1 | 1300 | 5 | 0 |
13 | 1 | 1400 | 66 | 7 |
14 | 1 | 1500 | 4 | 0 |
15 | 1 | 1600 | 5 | 0 |
16 | 1 | 1700 | 7 | 0 |
17 | 1 | 1800 | 43 | 0 |
18 | 1 | 1900 | 6 | 0 |
19 | 1 | 2000 | 4 | 0 |
20 | 2 | 1000 | 105 | 103 |
21 | 3 | 1000 | 75 | 60 |
22 | 4 | 900 | 1 | 0 |
23 | 4 | 1000 | 39 | 24 |
24 | 4 | 1600 | 1 | 0 |
25 | 5 | 1000 | 32 | 19 |
26 | 6 | 1000 | 2 | 1 |
27 | 6 | 2000 | 2 | 1 |
28 | 7 | 1000 | 7 | 6 |
29 | 8 | 1200 | 181 | 0 |
Following are terminal record:
Do you wish to mass scrape? [y/n] y
Do you wish clear ALL scraped files and kill ALL running Celery workers? [y/n] y
Clearing scraped files and all running workers, please wait...
OK
rm: cannot remove './run/celery/*': No such file or directory
rm: cannot remove './run/scrapy/*': No such file or directory
rm: cannot remove './logs/*': No such file or directory
Do you wish to start mass scraping now? Process will automatically exit when finished. [y] y
Creating Celery workers...
Error: No nodes replied within time constraint
Waiting for Celery workers to be online...
Error: No nodes replied within time constraint
Waiting for Celery workers to be online...
Error: No nodes replied within time constraint
Waiting for Celery workers to be online...
Waiting for Celery workers to be online...
Running Celery tasks for mass scrape...
Scrapy is still running...
Scrapy is still running...
Scrapy is still running...
Scrapy is still running...
Scrapy is still running...
...
Exit
1;100
, CTKH,CDKT, KQKD,LC,CSTC
, 1,2
I can confirm the endless loop issue is happening for me too. I will see what I can do.
Regarding the mass scrape, can you please raise another separate issue? My intuition tells me that it can be due to a network problem.
Appreciate your support Vince!
My intention was to scrape data for a machine learning project. So mass scraping is the best option.
I would manually select all the combinations of business-industry if not all 3,000 tickers are scraped by the mass option.
I see. I forgot to mention in the README that I have not tried/tested the mass scraping functionality, because there are too many tickers and financial reports to scrape. You can absolutely go that route if your research is about the whole market, otherwise I would suggest you select and scrape only the industries that you're interested in, just to avoid bombarding Vietstock with http requests 😄.
@truongphanduykhanh In the meantime, if you don't mind, please scrape the report types (e.g., CDKT, KQKD, etc.) separately. There seems to be a problem with either the LC
or CSTC
report that makes the scraper stuck in an endless loop. I have not found out what.
Got it. Thanks. Even without fixing it yet, your response speed is already quite a phenomenon.
@truongphanduykhanh I've attempted to "workaround" this issue by requiring the scraper to check for pages that are already crawled. Please try pulling the revamp
branch and run locally (you may want to run without docker-compose
to save time - see README for details) to see if the issue still persists. I myself is letting my computer run overnight too.
Edit: it terminated successfully using your mentioned scraping options (1;100
CTKH,CDKT,KQKD,LC,CSTC
1,2
)
I've tried with options (1;100
CTKH,CDKT,KQKD,LC,CSTC
1,2
). It worked perfectly!
Indeed, there're 136 tickers of {biz:1; ind:100} in bizType_ind_tickers.csv
. The options scrap 111 tickers of them. I've checked manually the 25 tickers remaining and they don't have data on Vietstock.
I'll try all other combinations of biz;ind. Thank you so much.
Awesome. I will merge the other branch into master later. Thanks!
@truongphanduykhanh If you use this program in any of your research, please credit it where applicable. I'd appreciate it!
Definitely Vince.
When I select the option scraping with business ID and industry ID, the execution goes into an endless loop. The downloaded json files are continuously replaced by new ones with exactly the same names.
Haven't checked if the issue happens for mass scraping or other options.