Write function to process ABR nodes contained in the XML files

bdcallen commented 5 years ago

@iangow As I mentioned in #4, I was doing some work analyzing the xml files. That work was determining what the fundamental nodes were in the xml data, and analyzing what the child nodes were, their multiplicities, and the attributes of each type of node. After doing this for each of the xml files, I've able to determine that underlying tree for the data in each file is exactly the same (I will put the structure up and make some comments in further posts). This issue is for writing a function which scrapes the data accordingly.

bdcallen commented 5 years ago

20190723_181229

bdcallen commented 5 years ago

@iangow Excuse my bad handwriting, but I will say a few things about the graph in the photo:

Each data point is contained in an ABR node (the head node of the graph). It has two attributes recordLastUpdatedDate and replaced.
Each ABR node has precisely one each of the following child nodes:

-- ABN: whose value contains the ABN number as a string. Has two attributes status and ABNStatusFromDate.

-- EntityType: contains two child nodes, each with a single value and no attributes, EntityTypeInd (Ind for "Index") and EntityTypeText.EntityType has no attributes itself.
Each ABR node has one or none of the following child nodes:

-- GST: contains information on whether an ABN has registered for the GST. This node's information is in fact contained entirely in two attributes status and GSTStatusFromDate, with no value or child nodes.

-- ASICNumber: contains information from ASIC. Has a single value which is usually the ACN or ARBN as a string. Has one attribute ASICNumberType

-- LegalEntity: no attributes, has two child nodes IndividualName and BusinessAddress

-- MainEntity: no attributes, has two child nodes NonIndividualName and BusinessAddress
Each ABR node has 0, 1 or more of the following child nodes:

-- OtherEntity: no attributes, has single child node NonIndividualName. This node seems to correspond with the "Trading Names" section of the corresponding page on the ABR ABN lookup website.

-- DGR: here, DGR stands for Deductible Gift Recipient. This node has no attributes, and has a single child node NonIndividualName.

bdcallen commented 5 years ago

@iangow Furthermore, the following can be said about the remaining child nodes at Level 3 and lower (ie. descendents of the child nodes (except ABN and EntityType) of ABR described above):

BusinessAddress: this node is a child of LegalEntity and MainEntity. It has the exact same structure everywhere. it has no attributes and one child node AddressDetails. In turn, AddressDetails has no attributes and two child nodes Postcode and State, which in turn have a single value each and no attributes.
NonIndividualName: this appears has a child node of MainEntity, OtherEntity and DGR. Has a single attribute type and a single child node NonIndividualNameText. In turn, NonIndividualNameText has a single value stored as a string and no attributes.
IndividualName: this node appears as a child of LegalEntity. Has a single attribute type. Each IndividualName node has precisely one FamilyName child node, zero or one of the child node NameTitle, and 0, 1 or more of the child node GivenName. In turn each of FamilyName, GivenName and NameTitle have no attributes, and a single value as a string.

bdcallen commented 5 years ago

@iangow A few issues arise out of the comments I've just posted, particularly with regard to nodes with multiplicities that can be more than one.

OtherEntity and DGR should probably have their own tables, as each can potentially appear a few hundred times under a single ABR node. We would then need a way of linking this to a main table which contains the information. I'm guessing the ABN would be a suitable field for this, unless I'm mistaken in believing that ABN's are never reused.
GivenName can appear more than once under each IndividualName node. However, the multiplicity is rarely more than two, so I think these nodes should be combined into a single field. I guess we could put these values into a list for given_names, but I think the better thing to do is to combine the GivenName nodes values with those of NameTitle and FamilyName into a string field with the name IndividualName.

bdcallen commented 5 years ago

@iangow I've just made a simple function in the latest commit, process_ABR_node, to generate a dataframe that corresponds to the information going into what will be the main table. It seems that there will be a little cleaning to do, with picking the formats for the variables we want, and handling issues with the names (particularly the given names from the IndividualName node).

bdcallen commented 5 years ago

@iangow I wrote a program in Python which downloads and handles the file management of the xml files from the bulk download site that you originally downloaded those zip files which contained the same information. I had hoped to do everything in R, but it turned out that you can't get the correct html links and code using static scraping, since the site uses Javascript (like ASIC does), hence dynamic scraping and Selenium were needed. Given my knowledge of doing this thing in Python over the last several, and what seems like a relative ease of use, I have chosen to do this bit in that language.I have just committed that program. The next step is to finish off a program in R that will iterate over the resulting xml files in the directory into which they are extracted from the zips, and write the data into the postgres database.

bdcallen commented 5 years ago

@iangow Given I have developed functions that can read the flattened xml files in process_abn_lookup.R, I think we can close this issue.

mccgr / abn_lookup

Write function to process ABR nodes contained in the XML files #5