Open olivierobert opened 2 years ago
The is_scraped
in CSV is responsible to check whether all the keywords are scraped or not by counting the created records related with the CSV. Background process with 10 keywords per job and counting by the created records result duplication and make the program to run forever. We must move each keyword to a single job so that we can handle the error. We must use batch job processing to process jobs in a single batch that has an id
can be referenced by the CSV to check whether all jobs are completed or not.
I would like to know will we abort the scraping and remove all results related with that CSV or just continue the background process?
Since the scraping would be separated by keyword, each job would be independent. If one of the job fails, then the status of is_scraped
for the CSV would be false
and we could show for which keyword it failed. However, we would still be able to dhow the scraped results for all the keywords for which the scraping job succeeded.
Issue
ScrapeGoolgeSearchResult
loops through a list of keywords (in batches of 10). As a result, any unhandled exception would break the scraping for several keywords instead of failing for only one keyword. Since the UI layer is tied to the attribute `is_scraped, it could lead to an endless loop.In addition, in the current implementation, the update of the attribute
is_scraped
is the responsibility of the CSV controller which is not ideal. Again, say the controller action fails for any unhandled exception, the attribute would not be updated ðŸ˜Overall, it makes things more complicated to manage the scraping status flag at the CSV file level.
Expected
The result of the scraping is managed at the keyword level and each keyword is processed in the background separately.