waiyan93 / web-scraping

0 stars 0 forks source link

Separate each keyword to its own background job #9

Open olivierobert opened 2 years ago

olivierobert commented 2 years ago

Issue

ScrapeGoolgeSearchResult loops through a list of keywords (in batches of 10). As a result, any unhandled exception would break the scraping for several keywords instead of failing for only one keyword. Since the UI layer is tied to the attribute `is_scraped, it could lead to an endless loop.

In addition, in the current implementation, the update of the attribute is_scraped is the responsibility of the CSV controller which is not ideal. Again, say the controller action fails for any unhandled exception, the attribute would not be updated 😭

Overall, it makes things more complicated to manage the scraping status flag at the CSV file level.

Expected

The result of the scraping is managed at the keyword level and each keyword is processed in the background separately.

waiyan93 commented 2 years ago

The is_scraped in CSV is responsible to check whether all the keywords are scraped or not by counting the created records related with the CSV. Background process with 10 keywords per job and counting by the created records result duplication and make the program to run forever. We must move each keyword to a single job so that we can handle the error. We must use batch job processing to process jobs in a single batch that has an id can be referenced by the CSV to check whether all jobs are completed or not.

waiyan93 commented 2 years ago

I would like to know will we abort the scraping and remove all results related with that CSV or just continue the background process?

olivierobert commented 2 years ago

Since the scraping would be separated by keyword, each job would be independent. If one of the job fails, then the status of is_scraped for the CSV would be false and we could show for which keyword it failed. However, we would still be able to dhow the scraped results for all the keywords for which the scraping job succeeded.