mccgr / abn_lookup

Code for creating tables containing the ABN's for companies registered with the Australian Business Register on the ABN lookup website (https://abr.business.gov.au/)
5 stars 3 forks source link

Add process for updating the data, including CRONJOB #15

Closed bdcallen closed 4 years ago

bdcallen commented 5 years ago

@iangow Most of what's needed to do this is in place already, we mostly need to add lines in the appropriate files for deleting the old tables and writing new ones for slotting in the updated data.

bdcallen commented 5 years ago

@iangow Just changed the title of this, as I think making a cronjob is a natural part of this issue.

bdcallen commented 5 years ago

@iangow I have just made an initial bash script to use as part of a cronjob. So, at least for this, we need a $CODE_DIR, and we need to choose a time for the cronjob.

bdcallen commented 5 years ago

@iangow Looking at page 7 of the ABN Bulk Extract readme, it says that the bulk extract is updated weekly. The bulk extract was last updated on 02/10/2019, which was Wednesday. So perhaps a weekly cronjob on the day after, on Thursday, would be best.

bdcallen commented 4 years ago

@iangow I have just been successful in getting the cronjob to run a bash script including a command to use get_abn_lookup_data.py. I modified my local abn_lookup bash script so that it included my variables (using export) for PGHOST, PGDATABASE, PGUSER, and PGPASSWORD, along with the lines

export ABN_LOOKUP_DIR=/home/bdcallen/abn_lookup
python3 $ABN_LOOKUP_DIR/get_abn_lookup_data.py

I ran into two main issues in getting cron to run the program:

(1) - My program was initially written to be used in the abn_lookup directory. So when the program called other programs in the folder, cron didn't actually know the full path (because I had not written the full path the programs). This was an easy fix, addressed in the above commit.

(2) - It turned out that when I tried to get cron to run after fixing (1), that I was getting an error that xsltproc couldn't be found. It turned out this was because xsltproc wasn't in the cron's path, so I set a line to set the PATH variable in the crontab below, which is the one I eventually ran successfully today

# m h  dom mon dow   command

PATH=/home/bdcallen/anaconda3/bin:/home/bdcallen/perl5/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin

20 16 * * * /home/bdcallen/abn_lookup/./abn_lookup_cronjob.sh

After including the PATH, cron was able to run xsltproc and thus the whole program. Everything worked as expected, and before the cronjob, the count from the abns table was

crsp=> SELECT COUNT(*) FROM abn_lookup.abns_old
;
  count   
----------
 14498162
(1 row)

and after it, it is now

crsp=> SELECT COUNT(*) FROM abn_lookup.abns;
  count   
----------
 14608789
(1 row)

I didn't realise how fussy cron is with respect to knowing all the paths for all programs used. As an aside, I suspect (1) or (2) or could be issues with the asxlisting cronjob.

bdcallen commented 4 years ago

@iangow I have left the cronjob as is for now, but with the timing amended to

00 6 * * 5 /home/bdcallen/abn_lookup/./abn_lookup_cronjob.sh

as the bulk extract is updated weekly on the website. So this program will run at 6am every Friday.

bdcallen commented 4 years ago

@iangow I've changed my crontab to this, and amended the bash script so that the PG variables aren't in it (I've also done the same for asic), and added a line setting ABN_LOOKUP_DIR in the crontab. Note that the bash script for the abn_lookup part uses ABN_LOOKUP_DIR instead of a variable CODE_DIR

# m h  dom mon dow   command

PATH=/home/bdcallen/anaconda3/bin:/home/bdcallen/perl5/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin
PGHOST=*************
PGDATABASE=***************
PGUSER=************
PGPASSWORD=*************
ABN_LOOKUP_DIR=***********
ASIC_DIR=**************

00 6 * * 5 /home/bdcallen/abn_lookup/./abn_lookup_cronjob.sh
00 3 * * 4 $ASIC_DIR/./asic_bulk_extract_cronjob.sh

Given I know this will work, as the asic part of the cronjob used the environmental variables correctly and executed successfully, I will close this (as well as the analogous issue for abn_lookup) for now. If I see an issue in the output of the cronjob on Friday (I've been getting its output to a file called dead.letter), perhaps we can reopen then.