manojpanchal90 / nimbletest

0 stars 0 forks source link

[Chore] Improve tracking of scraping status #4

Open olivierobert opened 11 months ago

olivierobert commented 11 months ago

Issue

Upon uploading keywords, the scraping will be performed via:

  1. A worker

https://github.com/manojpanchal90/nimbletest/blob/87d0b55477d930c13d3619acd3c92bb85780a3f4/app/workers/process_keyword_worker.rb#L4-L8

  1. A service object: https://github.com/manojpanchal90/nimbletest/blob/main/app/services/html_fetcher_service.rb

However, the keyword data will be lost if any error happens at the scraping step since the keyword has not yet been saved to the database.

[!NOTE] Your issues in running the Sidekiq queue have been taken into account. However, https://github.com/manojpanchal90/nimbletest/blob/87d0b55477d930c13d3619acd3c92bb85780a3f4/app/services/csv_processor.rb#L41 should be using perform_async to properly process keyword properly asynchronously instead of immediately.

Expected

The benefits are:

manojpanchal90 commented 11 months ago

@olivierobert I have implemented the changes as per your suggestion. From my understanding, it is necessary to save the keyword in the database before sending it for job processing. However, I encountered a challenge when attempting to execute this process before queuing, as it would require multiple database queries within a single request-response cycle. Consequently, I opted to incorporate the logic for creating the keyword within the job itself. Subsequently, I proceeded with the scraping part.

In reference to the specific implementation, you can find the changes in this d53e27790afc2238c9cebb0cbbd7229f88fd98fc (https://github.com/manojpanchal90/nimbletest/pull/8/commits/d53e27790afc2238c9cebb0cbbd7229f88fd98fc#diff-b651381f4cba5b34d4e0ed84effbb75dcc3ffd9a71034620889d5eff9e5a3278) at lines 15 to 16.

Additionally, I have implemented caching for a duration of 48 hours to ensure that a keyword is not processed again before the lapse of this time period from its last update.