Closed bdcallen closed 5 years ago
@iangow Excuse my bad handwriting, but I will say a few things about the graph in the photo:
Each data point is contained in an ABR
node (the head node of the graph). It has two attributes recordLastUpdatedDate
and replaced
.
Each ABR
node has precisely one each of the following child nodes:
-- ABN
: whose value contains the ABN number as a string. Has two attributes status
and ABNStatusFromDate
.
-- EntityType
: contains two child nodes, each with a single value and no attributes, EntityTypeInd
(Ind for "Index") and EntityTypeText
.EntityType
has no attributes itself.
Each ABR
node has one or none of the following child nodes:
-- GST
: contains information on whether an ABN has registered for the GST. This node's information is in fact contained entirely in two attributes status
and GSTStatusFromDate
, with no value or child nodes.
-- ASICNumber
: contains information from ASIC. Has a single value which is usually the ACN or ARBN as a string. Has one attribute ASICNumberType
-- LegalEntity
: no attributes, has two child nodes IndividualName
and BusinessAddress
-- MainEntity
: no attributes, has two child nodes NonIndividualName
and BusinessAddress
Each ABR
node has 0, 1 or more of the following child nodes:
-- OtherEntity
: no attributes, has single child node NonIndividualName
. This node seems to correspond with the "Trading Names" section of the corresponding page on the ABR ABN lookup website.
-- DGR
: here, DGR stands for Deductible Gift Recipient. This node has no attributes, and has a single child node NonIndividualName
.
@iangow Furthermore, the following can be said about the remaining child nodes at Level 3 and lower (ie. descendents of the child nodes (except ABN
and EntityType
) of ABR described above):
BusinessAddress
: this node is a child of LegalEntity
and MainEntity
. It has the exact same structure everywhere. it has no attributes and one child node AddressDetails
. In turn, AddressDetails
has no attributes and two child nodes Postcode
and State
, which in turn have a single value each and no attributes.
NonIndividualName
: this appears has a child node of MainEntity
, OtherEntity
and DGR
. Has a single attribute type
and a single child node NonIndividualNameText
. In turn, NonIndividualNameText
has a single value stored as a string and no attributes.
IndividualName
: this node appears as a child of LegalEntity
. Has a single attribute type
. Each IndividualName
node has precisely one FamilyName
child node, zero or one of the child node NameTitle
, and 0, 1 or more of the child node GivenName
. In turn each of FamilyName
, GivenName
and NameTitle
have no attributes, and a single value as a string.
@iangow A few issues arise out of the comments I've just posted, particularly with regard to nodes with multiplicities that can be more than one.
OtherEntity
and DGR
should probably have their own tables, as each can potentially appear a few hundred times under a single ABR
node. We would then need a way of linking this to a main table which contains the information. I'm guessing the ABN would be a suitable field for this, unless I'm mistaken in believing that ABN's are never reused.
GivenName
can appear more than once under each IndividualName
node. However, the multiplicity is rarely more than two, so I think these nodes should be combined into a single field. I guess we could put these values into a list for given_names
, but I think the better thing to do is to combine the GivenName
nodes values with those of NameTitle
and FamilyName
into a string field with the name IndividualName
.
@iangow I've just made a simple function in the latest commit, process_ABR_node
, to generate a dataframe that corresponds to the information going into what will be the main table. It seems that there will be a little cleaning to do, with picking the formats for the variables we want, and handling issues with the names (particularly the given names from the IndividualName
node).
@iangow I wrote a program in Python which downloads and handles the file management of the xml files from the bulk download site that you originally downloaded those zip files which contained the same information. I had hoped to do everything in R, but it turned out that you can't get the correct html links and code using static scraping, since the site uses Javascript (like ASIC does), hence dynamic scraping and Selenium were needed. Given my knowledge of doing this thing in Python over the last several, and what seems like a relative ease of use, I have chosen to do this bit in that language.I have just committed that program. The next step is to finish off a program in R that will iterate over the resulting xml files in the directory into which they are extracted from the zips, and write the data into the postgres database.
@iangow Given I have developed functions that can read the flattened xml files in process_abn_lookup.R, I think we can close this issue.
@iangow As I mentioned in #4, I was doing some work analyzing the xml files. That work was determining what the fundamental nodes were in the xml data, and analyzing what the child nodes were, their multiplicities, and the attributes of each type of node. After doing this for each of the xml files, I've able to determine that underlying tree for the data in each file is exactly the same (I will put the structure up and make some comments in further posts). This issue is for writing a function which scrapes the data accordingly.