openownership / bodspipelines

Shared library intended to support building pipelines to produce beneficial ownership statements (BODS) data.
GNU Affero General Public License v3.0
2 stars 0 forks source link

General Performance/Efficiency Improvements #14

Closed radix0000 closed 1 month ago

radix0000 commented 1 year ago

There are a number of performance/efficiency improvements to the current pipeline which we may want to consider, which would have general positive effects across the board (faster pipeline execution will help not only in production but also for future development), but specifically would highly beneficial to either of the two main options (2 or 3) for improving handling of updates to input data (see https://github.com/openownership/bodspipelines/issues/9), since both of these options on their own would likely result in significantly longer processing times. Improvements to consider would be:

  1. Improve XML parser performance
  2. Optimise Elasticsearch usage
  3. Decouple various sub-stages of pipeline (with concurrency or separate processes)
  4. Possibly improve Kinesis usage (though that has seen some work already)

Depending on exactly where the bottlenecks are there are likely to be significant performance gains that could be achieved with a small amount of effort, which would provide a good foundation to move forward from.

kathryn-ods commented 2 months ago

@radix0000 has this been done? I know you made a lot of performance improvements recently

radix0000 commented 1 month ago

Closing as done for now, but have created new issue to stick pin in potential future memory issue: https://github.com/openownership/bodspipelines/issues/19