mccgr / abn_lookup

Code for creating tables containing the ABN's for companies registered with the Australian Business Register on the ABN lookup website (https://abr.business.gov.au/)
5 stars 3 forks source link

Write function to process ABR nodes contained in the XML files #5

Closed bdcallen closed 5 years ago

bdcallen commented 5 years ago

@iangow As I mentioned in #4, I was doing some work analyzing the xml files. That work was determining what the fundamental nodes were in the xml data, and analyzing what the child nodes were, their multiplicities, and the attributes of each type of node. After doing this for each of the xml files, I've able to determine that underlying tree for the data in each file is exactly the same (I will put the structure up and make some comments in further posts). This issue is for writing a function which scrapes the data accordingly.

bdcallen commented 5 years ago

20190723_181229

bdcallen commented 5 years ago

@iangow Excuse my bad handwriting, but I will say a few things about the graph in the photo:

bdcallen commented 5 years ago

@iangow Furthermore, the following can be said about the remaining child nodes at Level 3 and lower (ie. descendents of the child nodes (except ABN and EntityType) of ABR described above):

bdcallen commented 5 years ago

@iangow A few issues arise out of the comments I've just posted, particularly with regard to nodes with multiplicities that can be more than one.

bdcallen commented 5 years ago

@iangow I've just made a simple function in the latest commit, process_ABR_node, to generate a dataframe that corresponds to the information going into what will be the main table. It seems that there will be a little cleaning to do, with picking the formats for the variables we want, and handling issues with the names (particularly the given names from the IndividualName node).

bdcallen commented 5 years ago

@iangow I wrote a program in Python which downloads and handles the file management of the xml files from the bulk download site that you originally downloaded those zip files which contained the same information. I had hoped to do everything in R, but it turned out that you can't get the correct html links and code using static scraping, since the site uses Javascript (like ASIC does), hence dynamic scraping and Selenium were needed. Given my knowledge of doing this thing in Python over the last several, and what seems like a relative ease of use, I have chosen to do this bit in that language.I have just committed that program. The next step is to finish off a program in R that will iterate over the resulting xml files in the directory into which they are extracted from the zips, and write the data into the postgres database.

bdcallen commented 5 years ago

@iangow Given I have developed functions that can read the flattened xml files in process_abn_lookup.R, I think we can close this issue.