openownership / bodsdata

Data analysis tools to help analysts, journalists and anyone wanting to examine and dive into beneficial ownership data published in line with the Beneficial Ownership Data Standard
https://bods-data.openownership.org/
Other
9 stars 0 forks source link

New output: convert BODS JSON to RDF using GraphDB #4

Open StephenAbbott opened 2 years ago

StephenAbbott commented 2 years ago

We received an offer from Cos at Blue Anvil for us to extend the BODS data analysis tools reusing their code - https://github.com/blueanvil/bods-rdf - in order to covert BODS data from the Register or any other source and ingest it into an RDF repository.

cosmin-marginean commented 2 years ago

Some of my initial thoughts on BODS-to-RDF integration and some challenges to consider.

  1. I'm assuming that OpenOwnership will provide and host for download the RDF format "atomically", correct? (i.e. an RDF-format register dataset will be available for each published BODS JSON register dataset)
  2. This is a long-running process (hours) and it's expected to increase with the register size.
  3. When integrating this, we should consider the option to also provide the RDF format for individual registers not just the combined register (https://github.com/openownership/bodsdata/issues/11)
  4. The conversion code at BODS-RDF (https://github.com/blueanvil/bods-rdf) is written in Kotlin (JVM) so there are several ways to proceed with integrating this, each with various implications:
    • 4.1 Integrate the code as a library in a processing pipeline running on JVM. This will require JVM coding and JVM processes on the OpenOwnership pipeline.
    • 4.2 Running the Gradle build to produce .ttl files for BODS data from JSONL format. This will only require a JVM 11+ available in the stack.
    • 4.3 Rewrite this in any of the Flatterer languages and integrate it there. As this seems to be Python/Rust, it means we won't be able to assist with it, so we'd need someone with experience in these languages for implementation (we'll obviously assist with the conceptual elements). However, I'd assume this would be the preferred/sane approach?
  5. The RDF vocabularies should probably be generated and provided as deliverables together with the RDF data set. This is a one-off that can be simply achieved with Gradle/JVM for each BODS schema release (Blue Anvil can do that periodically). Alternatively, it can be integrated with one of the options above.
StephenAbbott commented 2 years ago

Thanks @cosmin-marginean for the comprehensive feedback. Just back from holidays and catching up with updates. I'm due to work with our team on updates to the data analysis tools in August. Will be in touch as soon as possible

StephenAbbott commented 1 year ago

@StephenAbbott to speak to @ScatteredInk about this work - https://github.com/cosmin-marginean/kbods - by @cosmin-marginean

StephenAbbott commented 1 year ago

Bear in mind related discussion https://github.com/openownership/data-standard/issues/121

StephenAbbott commented 1 year ago

From @cosmin-marginean:

There is a Downloads section here which contains info on all BODS RDF datasets: https://github.com/cosmin-marginean/kbods/tree/main/kbods-rdf

I'm exporting these when I get a chance (once a month or so) and happy to host them in my S3 for now, so if you want to link to these feel free to do so.

I also have a short bash script to produce them if you ever want to include these in the registry pipeline on your side (takes a couple of hours to run though and needs about 50GBs of disk space).