Make the crawler concurrent

metakgp / iqps-go

Search for question papers when the library can't save you.

https://qp.metakgp.org

GNU General Public License v3.0

20 stars 11 forks source link

Make the crawler concurrent #91

Open rajivharlalka opened 2 months ago

rajivharlalka commented 2 months ago

Is your feature request related to a problem? Please describe. Currently the crawler sequentially fetches each paper details, parses it and downloads the paper. This can be made lot faster using go-routines.

shikharish commented 2 months ago

This was implemented then removed because it lead to the library website dropping requests.

rajivharlalka commented 2 months ago

Did the implementation have an upperbound on the number of parallel requests being made? AFAIR no. IMO using waitgroups to limit the number of concurrent workers to 2-3 should improve the perform significantly.

shikharish commented 2 months ago

I dont remember exactly. BTW we won't need to implement go-routines ourselves as colly has an option to enable async request and also limit them. Can test that.

proffapt commented 1 month ago

@rajivharlalka or @harshkhandeparkar please update the state of this issue to be reflected on the kanban.

harshkhandeparkar commented 1 month ago

@shikharish what should be the status of this?

shikharish commented 1 month ago

It is not needed as of now. We only need to run the crawler once or twice a semester so it's very low priority.

harshkhandeparkar commented 1 month ago

Is it hard to do?

shikharish commented 1 month ago

Not at all

harshkhandeparkar commented 1 month ago

Then just finish it off maybe?

harshkhandeparkar commented 1 month ago

No point in keeping hanging issues if they can be solved in a few minutes.

proffapt commented 1 month ago

@shikharish updates?

shikharish commented 1 month ago

Did some testing and turns out even using 2 go routines leads to dropping of 1-2 requests. Further increasing it to 6 goroutines makes it 3-4 requests.

Should we skip this one for now?

proffapt commented 1 month ago

Try to implement a retry function. Also, how many requests are you able to make concurrently. Even if it is more than one, then that's a win situation.

proffapt commented 1 month ago

Halting this, till we have time to have a look at it comfortably.