openownership / register

A demonstration transnational register of beneficial ownership data from the UK, Denmark, Slovakia and Armenia
https://register.openownership.org
GNU Affero General Public License v3.0
18 stars 3 forks source link

Refactor internal dependencies #215

Closed tiredpixel closed 5 months ago

tiredpixel commented 11 months ago

Register is dependent on every internal Register-related library, not just Register Common. Perhaps this is necessary, perhaps not. Having it dependent on everything means that despite there being different Transformer and Ingestor projects, Register frequently needs updating as part of that process. Not only that, but it results in more dependencies for Register itself, because of the need to satisfy internal Git libraries in the project itself (in contrast to libraries published to RubyGems).

Investigate whether all these internal dependencies are necessary, or whether anything can be removed (e.g. by moving Register-only structs into that project directly, rather than storing them in a library that part of which doesn't appear to be used anywhere else). However, it might be that these libraries are in fact necessary, because of raw data or similar.

It would be best for this ticket to wait for a broader discussion on the future of cross-repo structure for Register (e.g. Sources SK, Transformer SK, Ingestor SK).

This is a spin-off ticket from https://github.com/openownership/register/issues/188 .

tiredpixel commented 7 months ago

Register Common

The only parts of Register Common used are adapters/s3_adapter and structs/aws_credentials. These are used within config/initializers/adapters.rb for setting up the S3 adapter. I'm not fully convinced of the merits of this approach—however, that's irrelevant, since there are custom methods written for S3 use, and even if not, there is almost certainly merit in having at least one shared library across both Register Web and data pipelines. This is Register Common, so I'm satisfied that this is a valuable internal dependency.

Register Sources DK, Register Sources PSC, Register Sources SK

The only parts of these used are are repositories/x, where x varies between DK, PSC, and SK datasources because of inconsistent naming. These are used within app/repositories/raw_data_record_repository.rb to provide functionality for viewing raw data via the original source data link for an entity. Whilst this functionality is obviously important for Register, I'm a little unsure about the way in which it's been implemented: DK, PSC, and SK repositories are all registered within initialize method, and then searches are made within each repository, and those results mapped back to each repository according to different naming for each repository. This means that adding a repository to Register requires multiple things to be changed:

I think this can probably be simplified. However, doing so would require more consistency between Register Sources repositories, in particular in the repositories directory. This is exactly what is noted in https://github.com/openownership/register/issues/216 , so I'm parking this part until that's done.

Register Sources OC

It's not immediately clear where this is used. Nothing appears to be using it directly, so either it is unused or a hidden dependency. Main functionality of Register Web, including raw data, appears to work without this dependency.

Register Sources BODS

There appear to be two main uses of this repository:

  1. For structs related to BODS, and use within raw data repository. These are within structs, enums, and repositories.
  2. For register-specific files. These are within register.

(1) seems fair enough, since Register needs some data definitions such as are shared with data pipelines. I'm not convinced this needs to be in a whole separate repository, rather than, say, in Register Common, but that's a separate topic, related to the overall multi-repository structure in general. So, I think this can be left, within this ticket, at least.

(2) merits more analysis. Specifically, Register Sources BODS contains the following within register:

Thus, it appears that files within Sources BODS register are used only amongst files in the same directory, or within Register itself. Considering that, I think there's limited merit in having them live in a separate library, rather than simply within Register itself.

To-do

Concluding, to-do for this task:

tiredpixel commented 7 months ago

Correction re PaginatedArray

Since PaginatedArray is also used in Sources BODS repositories/bods_statement_repository.rb, it cannot be moved to Register. However, PaginatedArray subclasses Array and does not depend on anything else in Register & friends, meaning it is a standalone class. This is a good reason to extract it to Register Common, after which, both Sources BODS and Register can use it.

To-do

tiredpixel commented 7 months ago

Update re Sources BODS Register

After having spent some time digging through Sources BODS for register extraction, I've changed my mind. Although some things, such as search, could likely be better placed within Register itself, there is a tight coupling to Sources BODS, not only in usage, but also seemingly in many data definitions being placed there—even if not appararently used within Ingestors or Transformers. Thus, I'm leaving them as-is, but might well extract specific search functionality in the future, now I'm more acquainted with what's going on there.

tiredpixel commented 7 months ago

The remainder of this ticket is blocked by https://github.com/openownership/register/issues/216 .

tiredpixel commented 7 months ago

Given that I wasn't able to deduplicate as much in #216 as I'd hoped, because of the plethora of small differences between the code for each datasource, I don't think it would be straightforward to remove any further dependencies from Register, at present. This is very disappointing, but I'm not sure how much value there is in working on those sections of code further.