Open OlegPhenomenon opened 12 months ago
This pull request is split into 5 parts for easier review. 👀 Review pull request on Viezly
Changed files are located in these folders:
/
app/interactions/actions
app/jobs
app/mailers
app/models
app/views/mailers
db
lib/gem_monkey_patches
test
bundle exec rake company_status:check_all -- --open_data_file_path=lib/tasks/data/ettevotja_rekvisiidid__lihtandmed.csv --missing_companies_output_path=lib/tasks/data/missing_companies_in_business_registry.csv --deleted_companies_output_path=lib/tasks/data/deleted_companies_from_business_registry.csv --download_path=https://avaandmed.ariregister.rik.ee/sites/default/files/avaandmed/ettevotja_rekvisiidid__lihtandmed.csv.zip
This rake task performs the following actions:
Therefore, the attributes look like this:
open_data_file_path
- specifies where the data is saved and retrieved from. Default valuelib/tasks/data/ettevotja_rekvisiidid__lihtandmed.csv
missing_companies_output_path
- specifies the path where companies not found in the business registry will be saved. Default valuelib/tasks/data/missing_companies_in_business_registry.csv
deleted_companies_output_path
- specifies the path where companies that have been removed from the registry will be saved. Default valuedeleted_companies_from_business_registry.csv
download_path
- specifies where the data will be downloaded from. Default valuehttps://avaandmed.ariregister.rik.ee/sites/default/files/avaandmed/ettevotja_rekvisiidid__lihtandmed.csv.zip
Since this command already includes default values, it is not necessary to enter any parameters; they were simply added for greater flexibility. Therefore, you can run the following command:
bundle exec rake company_status:check_all
and the data will be available in the directory
lib/tasks/data
The job:
CompanyRegisterStatusJob.perform_later(days_interval = 14, spam_time_delay = 0.2, batch_size = 100, download_open_data_file_url='https://avaandmed.ariregister.rik.ee/sites/default/files/avaandmed/ettevotja_rekvisiidid__lihtandmed.csv.zip')
This job accepts the following parameters:
days_interval
- selects domains that were last checked more than {days_interval} days ago.spam_time_delay
- this is the time delay when querying the business registry.batch_size
- the size of the batch for processing. This is needed for optimization.download_open_data_file_url
- the URL from which to download the business registry data.As indicated above, all these values have default settings, so they can be modified if necessary.
What the job does:
POTENTIAL PROBLEM: It could happen that we decide to check a large array of data in one day, and say the next time we decide to check in a year, and logically this job might process a large list of companies exactly one year later. This should be kept in mind.
this PR related to this one #https://github.com/internetee/company_register/pull/6
related tickets: https://github.com/internetee/company_register/issues/4 https://github.com/internetee/company_register/issues/5