Open rajivharlalka opened 2 months ago
This was implemented then removed because it lead to the library website dropping requests.
Did the implementation have an upperbound on the number of parallel requests being made? AFAIR no. IMO using waitgroups to limit the number of concurrent workers to 2-3 should improve the perform significantly.
I dont remember exactly. BTW we won't need to implement go-routines ourselves as colly has an option to enable async request and also limit them. Can test that.
@rajivharlalka or @harshkhandeparkar please update the state of this issue to be reflected on the kanban.
@shikharish what should be the status of this?
It is not needed as of now. We only need to run the crawler once or twice a semester so it's very low priority.
Is it hard to do?
Not at all
Then just finish it off maybe?
No point in keeping hanging issues if they can be solved in a few minutes.
@shikharish updates?
Did some testing and turns out even using 2 go routines leads to dropping of 1-2 requests. Further increasing it to 6 goroutines makes it 3-4 requests.
Should we skip this one for now?
Try to implement a retry function. Also, how many requests are you able to make concurrently. Even if it is more than one, then that's a win situation.
Halting this, till we have time to have a look at it comfortably.
Is your feature request related to a problem? Please describe. Currently the crawler sequentially fetches each paper details, parses it and downloads the paper. This can be made lot faster using go-routines.