moov-io / ach

ACH implements a reader, writer, and validator for Automated Clearing House (ACH) files. The HTTP server is available in a Docker image and the Go package is available.
https://moov-io.github.io/ach/
Apache License 2.0
452 stars 150 forks source link

MergeFiles improvements #1309

Closed adamdecaf closed 1 month ago

adamdecaf commented 10 months ago

The Moov ACH library has a few ways of merging ACH files together, which can be useful to minimize the number of files sent over the network and cost from vendors, Fed, or other providers.

The performance and utility of MergeFiles can be improved and this issue sets out to discuss a few options.

Previous issues: https://github.com/moov-io/ach/pull/1041, https://github.com/moov-io/ach/pull/1276, https://github.com/moov-io/ach/pull/1282

midepeter commented 7 months ago

Hello @adamdecaf,

Just got to know about moov and its OSS community. I must say this is great and beautiful. I am golang engineer and recently looking into perfomance engineering. I will love to help take a look at issue and investigate and give a report on how it can be carried out if given the opportunity to.

Looking forward to hearing from you.

Thank you Peter.

adamdecaf commented 7 months ago

Sounds good! Let me know if you need anything or have questions. Do you have a place to start?

midepeter commented 7 months ago

@adamdecaf

Yes, I do. Will try to familiarize myself with the codebase and investigate properly. If i do have questions, i will relate them to you.

Thank you!

midepeter commented 7 months ago

@adamdecaf

  • Is it faster to progressively merge files (from disk/io.Reader, *ach.File instances, or another input?)

I have some questions regarding this statement. Do you mean I should verify which is more efficient between merging files directly from the disk as they are being read and merging the files after they have been read and converted to *ach.File instance before merging.

adamdecaf commented 7 months ago

There are a few options available. We currently need the files in memory (as an []*File) which is inefficient because each file has to be read from disk (or created in memory) before calling MergeFiles. There's a use-case in achgateway where we are reading files from disk and merging them, so a function to merge them faster in that path would be desired.

There may be faster ways to merge []*File as well, which would be an improvement. We need to keep the existing MergeFiles function for backwards compatibility.

I've linked a few PR's in the description which add benchmarks or previous performance improvements.

midepeter commented 7 months ago

Yea! Thanks for that

midepeter commented 7 months ago

Hello @adamdecaf

Not sure I was able to pull this off. But i will love to learn more about this codebase in general and probably pick up smaller tasks so i can get used to it because I could not really understand some parts of the code itself.

The only recommedation i have got is why we are using []*file and not []file because the latter has more memory perks than the former.

Thank you. If there are any smaller tasks i could take on that could get me used to the codebase, I will be really happy to pick it up.

adamdecaf commented 7 months ago

Why would []File have better memory perks than []*File? Passing around an array of pointers is lighter-weight than passing around the structs.

midepeter commented 7 months ago

I think that it is a general rule of thumb to try as much as possible to reduce the use of pointers in a program because it thereby reduces the amount of heap allocations done. It is not always so actually but it is so common in golang programs

-- compiler generates a lot of checks when using pointer which makes programs a bit slower -- pointers often have poor locality of reference -- it gives the garbage collector more work

Just some opinions regarding that

adamdecaf commented 7 months ago

We'd need to compare benchmarks between MergeFiles with and without pointers. Typically pointers to larger structs are more efficient and faster. See https://medium.com/@meeusdylan/when-to-use-pointers-in-go-44c15fe04eac

midepeter commented 7 months ago

Yes

Hm, Looking at it now, trying generate becnchmark reports for both cases. Quite a deep change in to the codebase. Not sure if we are ready to risk that at this point or i can go ahead.

adamdecaf commented 7 months ago

You can make whatever changes on a branch to compare pointers vs no pointers.

midepeter commented 7 months ago

Oh okay perfect

midepeter commented 6 months ago
  • Is it faster to progressively merge files (from disk/io.Reader, *ach.File instances, or another input?)

@adamdecaf

By the way just to note here that it has this very approach proven that this is will improve the memory performance but reduce the cpu time significantly which i believe is not what we desire.

here some resources I found regarding that

  1. https://gophers.slack.com/archives/C0VP8EF3R/p1708362344757979
  2. https://sourcegraph.com/blog/slow-to-simd [ In this article emphasizing on One possible mitigation would be to move the vectors to disk, but loading them from disk at search time can add [significant latency](https://colin-scott.github.io/personal_website/research/interactive_latency.html), especially on slow disks. Instead, we chose to compress our vectors with int8 quantization. ]
adamdecaf commented 6 months ago

We have a big usecase in achgateway to merge files on disk. We are currently reading files in groups to merge which has much better memory pressure, but takes more cpu time. If you can create a better method of merging files on disk that would be very helpful!

adamdecaf commented 1 month ago

Lots of fixes were released in https://github.com/moov-io/ach/releases/tag/v1.37.0