[BigQuery] High Number of Memory Allocations when migrating from legacy insertAll to Storage Write API.

indy-singh commented 1 year ago

Environment details

OS: Microsoft Windows Server 2019 Standard 10.0.17763 N/A Build 17763
.NET version: <TargetFramework>net6.0</TargetFramework>

Package name and version:-

<PackageReference Include="Google.Cloud.BigQuery.Storage.V1" Version="3.6.0" />
<PackageReference Include="Google.Cloud.BigQuery.V2" Version="3.2.0" />
<PackageReference Include="Google.Protobuf" Version="3.22.1" />
<PackageReference Include="Google.Protobuf.Tools" Version="3.22.1" />

Steps to reproduce

https://github.com/indy-singh/bigquery_storage_write_api_high_memory_allocation - note that I can't get this to leak, but it isn't 1:1 with production code. Just a test harness in which to try different scenarios.

Additional context: https://github.com/protocolbuffers/protobuf/issues/12217

We have recently migrated from the legacy insertAll to the Storage Write API and we've found that a marked increased in memory (see screenshot in link above). The migration is 1:1 as close as can be.

I've diff'ed the methods ExpensiveStreamingInsert and StorageWriteApi from this file https://github.com/indy-singh/bigquery_storage_write_api_high_memory_allocation/blob/c3b50881b61ccaa8fddfea01945d7f67563cab15/bigquery_storage_write_api_high_memory_allocation/WatchTowerBigQuerySaver.cs (this is the actual production code) and nothing looks super suspicious. Online diff: https://www.diffchecker.com/gAJLZUKz/

Which leads to me believe we are using the Storage Write APIs incorrectly (at construction).

We basically keep one AppendRowsStream open/alive for the duration of the application lifetime (which is typically 12+ hours)

var bigQueryWriteClientBuilder = new BigQueryWriteClientBuilder
{
    Credential = googleCredential,
}.Build();

_appendRowsStream = bigQueryWriteClientBuilder.AppendRows();

Is this incorrect? https://cloud.google.com/bigquery/docs/write-api-best-practices seems to imply what we are doing is correct.

Thanks for any help.

Cheers, Indy

jskeet commented 1 year ago

That looks fine to me - I can't see anywhere that we'd end up holding on to memory there, modulo my thought below.

I'd be tempted to create a new stream somewhat more often than once every 12 hours, but that's a different matter.

One aspect I hadn't considered before is that your code doesn't appear to be reading from the responses stream from the gRPC stream (see here in the original sample code). I wonder whether that means that data is getting buffered... I don't know enough about the BigQuery Storage Write API, but that might explain things.

Please could you try including some code to read from the response stream in the background, and see if that makes any difference? (If this is the problem, we'll definitely want to update the sample to emphasize it more.)

indy-singh commented 1 year ago

Hi @jskeet ,

Happy to do that but I've just trying a slightly different tack; deployed a minor change in which we fetch the default stream every time we want to do a write. The saver pulses every 20 seconds or so, so every 20 seconds it would fetch the default stream, write, and then mark it as complete.

So we basically do:-

var appendRowsStream = _bigQueryWriteClientBuilder.AppendRows();

await appendRowsStream.WriteAsync(new AppendRowsRequest
{
    ProtoRows = protoData,
    WriteStream = _writeStreamName,
});

await appendRowsStream.WriteCompleteAsync();

Inside the while loop.

I'm wondering like you if the missing call to WriteCompleteAsync means that all the data I've ever written stays in a buffer somewhere and then eventually causes a "leak".

Cheers, Indy

jskeet commented 1 year ago

We try to write the data as you go - we definitely don't wait for a WriteComplete method. But gRPC only allows a single request in a stream to be in flight at a time. We use BufferedClientStreamWriter to buffer requests. But actually, as you're awaiting the task returned by WriteAsync, that shouldn't be the problem (unless there's a bug in BufferedClientStreamWriter).

It's a shame that your test code doesn't reproduce the problem at the moment... but I'll give it a try anyway and see if I can see anything odd going on in the BufferedClientStreamWriter in terms of it retaining objects when it shouldn't, for example.

indy-singh commented 1 year ago

I think there is an element of time at play here, the migration to the Storage Write API clearly leaks (need to find a better word) from our monitoring tooling POV. But it's not immediate for example, I left the code overnight on DEMO and one of the boxes jumped at 21:00 for no apparent reason I'm the only one in this app company-wide so it has to be the recent changes. I'll let you know what my different tack reveals if anything.

Cheers, Indy

jskeet commented 1 year ago

One interesting idea: in your real code, do you have multiple threads calling WriteAsync? Could the memory jump be explained by all of them effectively blocking, with the underlying linked list of requests just getting longer and longer?

We wouldn't see that in the sample code, as you're waiting for one write to complete before creating another one... but if your real code is more event-based, so you could have parallel threads writing and waiting, that could explain it.

(When running your test code in a debugger, _appendRowsStream._writeBuffer.BufferedWriteCount is always 0 for me.)

indy-singh commented 1 year ago

We don't have multiple threads calling WriteAsync but it is a very hot code path on our LIVE environment and the while loop will do the write to BQ under two OR conditions:

If it has over the max batch size (10,000)
Or if it has been x amount of time since the last write (e.g. 20 seconds) and we have at least one thing to write

I can try and reflect over that (_appendRowsStream._writeBuffer.BufferedWriteCount) and output it in the logging tomorrow.

Cheers, Indy

jskeet commented 1 year ago

We don't have multiple threads calling WriteAsync but it is a very hot code path on our LIVE environment

Right. In your real code, do you await the task immediately, or queue it somewhere? (Even if you await it, could another job use the same thread to write again?) In the commented-out code, it does look like you're awaiting it directly, and that only that loop will write to it...

I can try and reflect over that (_appendRowsStream._writeBuffer.BufferedWriteCount) and output it in the logging tomorrow.

That would be really useful, thanks.

indy-singh commented 1 year ago

In your real code, do you await the task immediately, or queue it somewhere?

It is awaited directly, and only that loop invokes the write to BQ.

Cheers, Indy

indy-singh commented 1 year ago

Sorry, missed this bit.

Even if you await it, could another job use the same thread to write again?

No. Anything and everything is allowed to call WatchTowerBigQuerySaver.Instance.Enqueue inside this app https://github.com/indy-singh/bigquery_storage_write_api_high_memory_allocation/blob/c3b50881b61ccaa8fddfea01945d7f67563cab15/bigquery_storage_write_api_high_memory_allocation/WatchTowerBigQuerySaver.cs#L486 but in reality only one place does call it.

But all that does is add it to the concurrent queue inside of WatchTowerBigQuerySaver which is then read from inside of the while loop invocation:

Cheers, Indy

jskeet commented 1 year ago

Right - thanks for all the extra detail. I don't see how we could end up buffering anything unless there's a bug in BufferedClientStreamWriter - but that's always a possibility.

It might be worth something logging each time you call WriteAsync, with the number of items and how long it takes for the returned task to complete. Not because I expect anything in particular - just because it looks like "something" happens, and it'd be interesting to see whether the stats for the call change at the same time.

indy-singh commented 1 year ago

So I managed to get _appendRowsStream._writeBuffer.BufferedWriteCount to be something other than 0 (it is 1 after the call to WriteComplete) in this commit. I'm not entirely what that means yet. But I know the row(s) are already in BQ before I call WriteComplete.

So I guess that is just a signal to that instance of the stream that "I'm done with you please clean yourself up".

Cheers, Indy

jskeet commented 1 year ago

Yes, the "it's complete" part is added as an extra entry in the queue here, which is spotted in the processing here.

The fact that it's never removed from the queue (because there's no continuation) may or may not be intentional - it's been 6 years since I wrote the first version of this code... but it shouldn't do any harm, and definitely wouldn't explain the high memory usage you're seeing.

It could be that this queuing is a red herring, of course... but it would be nice to rule it out.

Just to check my own thought process here:

You're only converting from your internal format into the API message using ByteStrings just before sending
You've used a profiler to show that a lot of the memory is taken up by ByteStrings

For the second part, of that, was that profiler running against a "real" system (sending messages) or just the early benchmark which did nothing but create the API messages? If it's the latter, it's understandable - but if you're getting a lot of ByteStrings building up in the real system (whether live or demo) then that would suggest it has to be something building up late on, rather than in your own internal queue.

indy-singh commented 1 year ago

You're only converting from your internal format into the API message using ByteStrings just before sending

Correct. See here for legacy insert and here for modern insert. I did note that this:-

new ProtoSchema
{
    ProtoDescriptor = WatchtowerRecord.Descriptor.ToProto(),
};

Also caused allocations on every invocation hence why I hoisted it up into the constructor.

You've used a profiler to show that a lot of the memory is taken up by ByteStrings For the second part, of that, was that profiler running against a "real" system (sending messages) or just the early benchmark which did nothing but create the API messages?

Correct and it was just an early benchmark which did nothing but create the API messages. The reason behind this is legacy insert is well battle tested it has been pushing ~50k total_requests per day / ~150 million total_rows per day / ~100GB total_input_bytes per day since 05 July 2021 10:44:15 I'm using this table to gather that information.

It was a simple diff of the legacy insert vs modern insert, see what actually had changed after which a console app to reproduce and put through dotMemory.

That being said, it's clear now that ToByteString was a red herring (it still does a lot of allocations in Gen0 - ArrayPool?). I now think it's slightly more complicated that as this change has been overnight on our DEMO environment and so far is stable.

My guess is that having a long lived default stream and writing to it over a period of time (lets say hours) without ever calling WriteCompleteAsync can eventually cause a memory pressure/leak situation. I've been trying to get a faster feedback loop so we can remove the time element of this, but it's proving difficult to pin down.

Please could you try including some code to read from the response stream in the background, and see if that makes any difference?

I still want to do this, but I don't want to change too much too quickly when faced with a new problem. Perhaps one more night.

Cheers, Indy

jskeet commented 1 year ago

I did note that this: (ProtoSchema initialization) Also caused allocations on every invocation hence why I hoisted it up into the constructor.

Yes, that was something I was going to suggest - although it wouldn't be doing a lot.

My guess is that having a long lived default stream and writing to it over a period of time (lets say hours) without ever calling WriteCompleteAsync can eventually cause a memory pressure/leak situation

It might cause issues in the BigQuery Storage API, but it shouldn't cause any local memory issues - because we don't hold on to the messages locally.

I don't know much about the BigQuery Storage API, but it sounds like it should be fine to fetch and write to the default stream frequently. If you wanted to move to use application streams, I suspect your "create stream, write, complete request" every 20 seconds might cause issues, and instead you'd want to keep it open for some time - but I'd probably guess at an hour being a reasonable time rather than "the whole lifetime of the process".

I still really don't understand why you were getting local memory issues though... next time you're making a change, if you could add the logging as per this earlier comment, that would be good to know:

It might be worth something logging each time you call WriteAsync, with the number of items and how long it takes for the returned task to complete. Not because I expect anything in particular - just because it looks like "something" happens, and it'd be interesting to see whether the stats for the call change at the same time.

(If you're not already logging the length of your own queue, that would be good to know too.)

indy-singh commented 1 year ago

Hi @jskeet,

Apologies, I've been AFK. I now have the additional logging (see attached zip). The logs are from our DEMO environment which is low traffic, so where beforeAggr ramps up to 10000 that is me load testing with good old fashioned curl.

This is the code that I logged out:-

var sw = Stopwatch.StartNew();

try
{
    sw.Restart();
    var protoData = new AppendRowsRequest.Types.ProtoData
    {
        WriterSchema = _writerSchema,
        Rows = new ProtoRows
        {
            SerializedRows = { list.Select(x => x.ToByteString()) },
        },
    };

    var timeToCreateRows = sw.Elapsed.TotalMilliseconds;

    sw.Restart();
    var appendRowsStream = _bigQueryWriteClientBuilder.AppendRows();
    var timeToFetchDefaultStream = sw.Elapsed.TotalMilliseconds;

    sw.Restart();
    await appendRowsStream.WriteAsync(new AppendRowsRequest
    {
        ProtoRows = protoData,
        WriteStream = _writeStreamName,
    });
    var timeToWriteAsync = sw.Elapsed.TotalMilliseconds;

    sw.Restart();
    await appendRowsStream.WriteCompleteAsync();
    var timeToWriteCompleteAsync = sw.Elapsed.TotalMilliseconds;

    if (Settings.Config.IsEnabled("OtherBigQueryStatsSaverLogsOn"))
    {
        Log.Warn(new LogDetails($"Aggregated {length} individual saves to be {localConcurrentDictionary.Count} saves")
        {
            ObjectDump = new
            {
                count = list.Count,
                duration = sw.Elapsed.TotalMilliseconds,
                remaining = _concurrentQueue.Count,
                beforeAggr = length,
                afterAggr = localConcurrentDictionary.Count,
                sizeOfRequest = list.Sum(x => x.CalculateSize()),
                timeToCreateRows,
                timeToFetchDefaultStream,
                timeToWriteAsync,
                timeToWriteCompleteAsync
            }
        });
    }
}

logs.zip

Cheers, Indy

jskeet commented 1 year ago

Thanks, I'll have a look at the logs tomorrow.

indy-singh commented 1 year ago

There's not many, I'll need to rinse and repeat on production either later on tonight or first thing in the morning. Unfortunately, because it logs a lot of production it's not something I can leave on unattended.

jskeet commented 1 year ago

Thanks for those logs. The log entry at 2023-03-15 17:19:25.312785 which seems to take 20 seconds is slightly worrying - but I wonder whether that was while the server was busy with other initialization tasks. I'm glad you mentioned the load testing for the period where "remaining" didn't stay at 0.

But otherwise, this doesn't seem to show anything particularly concerning. Given the short timeframe, I assume you didn't spot any memory issues during this run either? Will look forward to further logs - sorry that it's taking a while to get to the bottom of this.

indy-singh commented 1 year ago

Hi @jskeet ,

I'm not sure I understand where you are getting 20 seconds from? All those time are in milliseconds (e.g. sw.Elapsed.TotalMilliseconds).

I've switched on the logs for prod since 2023-03-16T09:16:52 please find them attached I'll keep them switched on throughout the day.

prod_logs.zip

The good news is I don't see any memory issues right now, so we might have fixed by fetching the default stream and completing it around a write.

Cheers, Indy

indy-singh commented 1 year ago

Will look forward to further logs - sorry that it's taking a while to get to the bottom of this.

Ah it's no big deal fella, it's being investigated and it's not uncommon for me to spend a few days on platform issues like this. It's actually pretty fun! 👍🏽

Cheers, Indy

jskeet commented 1 year ago

Ah, whoops - for some reason I was assuming seconds. 20ms really shouldn't be a problem :)

Glad to hear completing the write looks like it may have fixed things. As I mentioned before, I don't know whether it's really appropriate to stream just a single request each time - but it does sound like you're well within the limits. You might want to consider streaming for somewhat longer (e.g. 10 minutes or an hour) but this isn't an area of expertise for me.

(I still don't understand why it would change the memory usage though....)

indy-singh commented 1 year ago

I still think there is a very gentle incline in memory consumption. It's hard to see over a day, but if we graph it over two weeks you can see the upward slopes. You can ignore the spikes entirely, those are post-deploy spikes which I think are to do with NET6 itself they recover after a period of time.

The real test will be the weekend on our DEMO environment. You can see that 2023-03-10 through to 2023-03-13 is flat. And after 2023-03-13 is when I started the migration to Storage Write API.

Cheers, Indy

indy-singh commented 1 year ago

I've commandeered a DEMO box and I'm going to let:-

dotnet-counters collect --process-id my_pid --refresh-interval 30 --format csv

Run for an hour or so, to see if that will tell us anything useful.

https://learn.microsoft.com/en-us/dotnet/core/diagnostics/dotnet-counters#examples

indy-singh commented 1 year ago

Not sure how useful this is going to be, but at least it gives us a baseline.

demo_counters.zip

indy-singh commented 1 year ago

I rolled out the reflection code to log out bufferedWriteCount at 2023-03-16 15:26:07.47691+00 and across 495 (that's all of them so far) samples it is always 1.

Cheers, Indy

indy-singh commented 1 year ago

So I can get this to happen:-

0893/5000       35,939 kb
0894/5000       35,969 kb
0895/5000       35,998 kb
0896/5000       36,014 kb
0897/5000       36,058 kb
0898/5000       36,092 kb
0899/5000       36,098 kb
0900/5000       36,149 kb
0901/5000       36,169 kb
0902/5000       36,211 kb
0903/5000       36,237 kb
0904/5000       36,271 kb
0905/5000       36,287 kb
0906/5000       36,339 kb
0907/5000       36,373 kb
0908/5000       36,399 kb
0909/5000       36,425 kb
0910/5000       36,458 kb
0911/5000       36,466 kb
0912/5000       36,518 kb
0913/5000       36,555 kb
0914/5000       36,586 kb
0915/5000       36,620 kb
0916/5000       36,646 kb
0917/5000       36,671 kb
0918/5000       36,696 kb
0919/5000       36,768 kb
0920/5000       36,786 kb
0921/5000       36,837 kb
0922/5000       36,871 kb
0923/5000       36,905 kb
0924/5000       36,931 kb
0925/5000       36,963 kb
0926/5000       36,970 kb
0927/5000       37,012 kb
0928/5000       37,056 kb
0929/5000       37,064 kb

With this commit. But I think that is sort of expected as there is a lot of byte[] allocation churn when adding to SerializedRows.

Cheers, Indy

jskeet commented 1 year ago

Right. In particular, this is very different from the "upward cliffs" you showed on graphs earlier on. I would personally expect that over time, that gradual upward trend will settle down to be flat. (I expect the large object heap to end up fragmented, but eventually you'll have had "the largest request" and everything else will be fit.)

indy-singh commented 1 year ago

Yeah absolutely no upwards cliffs since the change to fetch the default stream each time and call write complete afterwards.

There is a very slow rise in the last screenshot I posted that might warrant further investigation. But at this point I'm kinda out of roads to investigate. I think it needs more time on DEMO and LIVE environments, until then we can pause this issue.

Cheers, Indy

jskeet commented 1 year ago

Yup - will wait for you to provide more information after it's had more time to soak. I do wish I understood the difference... I really, really wouldn't expect it to make this difference. (If you end up with enough time to try going back to the "broken" code but with the extra logging, that's great - but I completely understand that's extra work with little benefit, if the problem is basically fixed for you.)

indy-singh commented 1 year ago

I'm 100% going to go back to the old code on a spare machine and leave it inserting over night to see what crops up. Hopefully tomorrow or early next.

The plan right now is to leave the "new" code in on DEMO and LIVE and see what occurs over night. We get plenty of traffic overnight on LIVE so that'll be a litmus test for the "new" code.

In the mean time I don't really know the GCP dotnet libs that well, so I'm going to spend some time with a cuppa and Jammy Dodgers reading through anything that piques my interest.

Cheers, Indy

indy-singh commented 1 year ago

Just attaching the prod logs as we reach the end of the day. I had a quick peruse through and nothing jumped out as wrong the memory is stable on DEMO and LIVE.

2023-03-16_prod-logs-end-of-day.zip

indy-singh commented 1 year ago

Hi,

Quick update, absolutely no issues over night. Totally fine. I guess the fetch default stream + write complete was the solution.

I still plan on going back to the bad code and letting it run on a spare machine. I will update when I have more information.

Thanks, Indy

indy-singh commented 1 year ago

Hi,

So I left a console app running for a few hours with two different workloads (individually). This is with the "bad code" where we fetch the default stream at construction and leave it open and write to it on schedule.

output.log is looping over a single collection of 50,000 items and inserting one by one with a 1-minute wait in between.

output2.log is looping 10,000 times and inserting a collection of 5,000 items (to reproduce LIVE like behaviour) with a 1-minute wait in between.

And other than the "slow" memory leak (that is down to the byte[] allocation churn) there is no upwards cliff of memory consumption. I'm tempted to conclude this portion of the issue as "can't reproduce".

But I would like to pivot and see if anything can be done about the slow memory leak.

This is 2023-03-11:

This is 2023-03-18:

Please note the x-axis on both graphs;

2023-03-11 is flat and using the legacy insertAll.
2023-03-18 has a relatively slow incline in memory consumption and is using the modern protobuf insert.

Happy to close this issue and mark as "can't reproduce" and open another issue around the slow memory leak.

I'm aware that it is the weekend, so please don't feel obliged to reply until you are ready.

Cheers, Indy

jskeet commented 1 year ago

It's not clear to me yet whether the "slow leak" is actually a leak at all, or whether it's just slowly reaching an asymptotic maximum... the last 6 hours look fairly flat, for example. If you're able to run the same test for longer - or possibly run the same code but with a shorter gap between inserts - that would be good. Ultimately if this is a matter of "creating large objects can lead to heap fragmentation" there's not much we can do about that within protobuf, gRPC or the client libraries... although you might want to try writing many smaller batches. (You could still "wake up" at the same threshold, but then write 50 requests of 100 items instead of 1 request with 5000 items, for example.) That's likely to significantly reduce the size of each request. If you could get the sizes down to under the large object threshold (still 85,000 bytes, I believe) that could really help.

I'll close this issue for now, but happy to help on a new one - see if the suggestions above are any use to you. Thanks for such a constructive engagement.

indy-singh commented 1 year ago

I don't want to keep a closed issue alive but I do consider this a regression in memory consumption (I agree the word leak is wrong for this context) the legacy insertAll method did not show any kind of memory concern. The only thing that has changed was the migration to the Storage Write API.

I could indeed tweak the saver to run more often and send a smaller payload around 81,920 bytes but that does mean the Storage Write API migration isn't straight forward.

For now we are still pursuing the Storage Write API.

Cheers for the help so far.

indy-singh commented 1 year ago

Ah, problem still exists even with the new code.

Bummer. Not sure where to dig into next.

jskeet commented 1 year ago

The drops to zero shortly before the large spike seem "interesting" (and it's also interesting to note that although there are the shallow increasing slops, they drop back down about once a day).

Do you have logs for around this spike, to see whether you ended up with a backlog etc?

indy-singh commented 1 year ago

Apologies for brevity, on mobile.

The drops to zero will be deploys.

We don't have logs as I turned them off. More than happy to leave logging on for an extended period of time but I'm fairly sure there was not a huge backlog.

Again we've had this code using the legacy insertAll for years now and only just migrated to storage write api. Frustratingly it doesn't seem easy to reproduce.

Indy

On Tue, 21 Mar 2023, 09:36 Jon Skeet, @.***> wrote:

The drops to zero shortly before the large spike seem "interesting" (and it's also interesting to note that although there are the shallow increasing slops, they drop back down about once a day).

Do you have logs for around this spike, to see whether you ended up with a backlog etc?

— Reply to this email directly, view it on GitHub https://github.com/googleapis/google-cloud-dotnet/issues/10034#issuecomment-1477523483, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAORQWTI4D22H2BWUEYBMMTW5FZBNANCNFSM6AAAAAAV2VQV5U . You are receiving this because you authored the thread.Message ID: @.***>

indy-singh commented 1 year ago

Hi,

Headline:

The migration to BQ Storage Write API has been exonerated and completely ruled out as the memory culprit.

Longer explanation:

We had another memory spike at 2023-03-21 15:38:00+00 and we had diagnostic logging turned on. We have 12 production boxes and only one (IIS001) showed the memory spike at that time. Our LIVE traffic is very consistent so I would expect the diagnostic logging to show a massive influx of requests waiting to be saved and other boxes to be affected.

However, that was not the case. Looking at the logs (I've attached the raw logs) so anyone can re-create the graphs.

data-1679476642377.zip

There are no abnormalities or huge spikes in pending writes.

This graph shows the the five minute window around 15:38 and it doesn't look wrong. If the remaining count had shot up to say 40,000 then I would say there is an issue but as you can see it is relatively boring. And then when I rinse and repeat with other boxes that did not have the memory at the time, the charts are pretty similar.

(pro-tip: this is how I got the the window to highlight).

For anyone else looking at the logs:-

{
  "count": "is how many rows we are going to save to BQ",
  "duration": "can be ignored, timeTo* is more accurate per BQ operation",
  "afterAggr": "same as count",
  "remaining": "how many pending writes left in the backing queue",
  "beforeAggr": "is how many pending writes we took off the backing queue",
  "sizeOfRequest": "how big the protobuf request is in bytes that we are about to send to BQ",
  "timeToCreateRows": "how long in ms",
  "timeToWriteAsync": "how long in ms",
  "bufferedWriteCount": "can be ignored, it is always one",
  "timeToFetchDefaultStream": "how long in ms",
  "timeToWriteCompleteAsync": "how long in ms"
}

New working theory:

This app was recently migrated from 472 to dotnet6.0. The memory baseline is a lot lower with 6.0:-

So the memory spikes could hide among the weeds before 6.0 but now they show up because the baseline is a lot lower.

@jskeet Sorry this took me a while to unwind and erroneously point the finger at BQ/Storage Write API. I don't believe in coincidences and it was a classic "can't see the wood for the trees" situation. Apologies again.

Cheers, Indy

jskeet commented 1 year ago

No problem - thanks for all this information. Hope your continued migration to the Storage Write API is smooth :)

indy-singh commented 1 year ago

Ultimately if this is a matter of "creating large objects can lead to heap fragmentation" there's not much we can do about that within protobuf, gRPC or the client libraries... although you might want to try writing many smaller batches. (You could still "wake up" at the same threshold, but then write 50 requests of 100 items instead of 1 request with 5000 items, for example.) That's likely to significantly reduce the size of each request. If you could get the sizes down to under the large object threshold (still 85,000 bytes, I believe) that could really help.

This has been done for a week or so, and while the incline has lessened (e.g. it is not as steep) there is still a steady growth.

We write in batches of 400 and each list is no larger than 60k-70k which is well under the threshold of ~85k

"avg"   "min"   "max"   "stddev"    "count"
60187.79520977003   102 70098   13603.80623957  1164704

using var cancellationTokenSource = new CancellationTokenSource(TimeSpan.FromSeconds(5));
var appendRowsStream = _bigQueryWriteClientBuilder.AppendRows(CallSettings.FromCancellationToken(cancellationTokenSource.Token));
var sizes = new List<int>();

// ToByteString creates a lot of memory churn on Gen0
// The average watchtower record when protobuf is around 162.4876265161838294 bytes (sample size of 115215).
// so we now chunk over the list in chunks of 400 which would put us at around 64,800 bytes for this list in it's entirety.
// previously the average list size is 453047.083047893235 bytes (sample size of 115319)
// which was going onto the LOH
foreach (var records in list.Chunk(400))
{
    sizes.Add(records.Sum(x => x.CalculateSize()));

    var protoData = new AppendRowsRequest.Types.ProtoData
    {
        WriterSchema = _writerSchema,
        Rows = new ProtoRows
        {
            SerializedRows = { records.Select(x => x.ToByteString()) },
        },
    };

    await appendRowsStream.WriteAsync(new AppendRowsRequest
    {
        ProtoRows = protoData,
        WriteStream = _writeStreamName,
    }).ConfigureAwait(false);
}

await appendRowsStream.WriteCompleteAsync().ConfigureAwait(false);

I've had a poke around the library and it appears to be taking a copy of the byte[] in a few places - is there a chance of introducing an array pool?

Thanks, Indy

jskeet commented 1 year ago

I've had a poke around the library and it appears to be taking a copy of the byte[] in a few places - is there a chance of introducing an array pool?

Almost certainly not. There's rarely a suitable place to put it - at least not within the protobuf runtime itself, and I suspect that's where most of the actual copying is happening.

I'm still surprised to see anything resembling a leak though - I can't think of anywhere that would statically keep any data. Given that you're writing the records out to BQ, is there any way you can "replay" all the data in a local way without the BQ server itself actually being involved (using a fake, basically) so that we could try to reproduce it that way? (The more we can isolate this, the better.)

To be really clear about this - I can't see anything wrong with the code you've got. It looks absolutely fine to me.

One other thing you could try: this is now running on .NET 6.0, right? That means it'll be using Grpc.Net.Client by default. I very much doubt that it's a gRPC-implementation issue, but it's just possible. If you add a manual NuGet dependency on Grpc.Core, and set the GrpcAdapter to GrpcCoreAdapter.Instance in the builder for the client (see the transport selection docs) then it would be good to know if we get the same sort of behavior. If we do observe any difference, that would definitely help to pin it down.

indy-singh commented 1 year ago

Sorry, I'm going to ask the naïve questions to make sure that we are on the same page.

Previously, the theory (that I was peddling) is that ToByteString was the culprit as the collection it ended up producing was in excess of 85,000 bytes - so it made it's way to the LOH.

However, as we now chunk over the list in ranges of 400, we know that no list exceed the 60,000-70,000 bytes range and stays well under the magic number of 85,000 bytes.

This is why we've now turned our attention to:-

One other thing you could try: this is now running on .NET 6.0, right?

Yes

That means it'll be using Grpc.Net.Client by default. I very much doubt that it's a gRPC-implementation issue, but it's just possible. If you add a manual NuGet dependency on Grpc.Core, and set the GrpcAdapter to GrpcCoreAdapter.Instance in the builder for the client (see the transport selection docs) then it would be good to know if we get the same sort of behavior. If we do observe any difference, that would definitely help to pin it down.

Which I've pushed out to DEMO this morning and will deploy to LIVE later on today.

Cheers, Indy

jskeet commented 1 year ago

Sorry, I'm going to ask the naïve questions to make sure that we are on the same page.

All sounds good :) The more we can make sure we're on the same page, the better. And if we can end up with an isolated repro, that would be even better, of course. Let's see how things go with Grpc.Core...

indy-singh commented 1 year ago

Setting GrpcAdapter = GrpcCoreAdapter.Instance didn't make a difference.

The faded red line is where that change was deployed. The slow incline remains the same.

I've just pushed out a change so that the BQ savers are disabled on one of the four DEMO boxes I expect that box to remain flat and the remaining three to continue to slow incline. Otherwise egg on my face :)

Cheers, Indy

jskeet commented 1 year ago

Okay - I guess it's "probably good" to know that it's not a gRPC issue. Aside from anything else, that means I'd expect us to be able to reproduce it with a fake gRPC server CallInvoker that doesn't actually do anything. Will wait to see the results of your latest test before we embark on that though.

indy-singh commented 1 year ago

Apologies for the delay in providing an update, I had hoped to do it this morning having pushed out the change at 9pm last night. However, I wasn't aware that there was ongoing maintenance to our DEMO environment that invalidated any results.

Re-did the experimental change this morning; can you tell which box has it turned off?

So I'm not super sure where to go from here. It's either our code or the GCP library code. Not sure what it could be in our code, as it's mostly been the same for 2+ years.

Any ideas? I'm tempted to take a memory dump a few hours apart and see if I can compare them in dotMemory.

Cheers, Indy

jskeet commented 1 year ago

Any ideas? I'm tempted to take a memory dump a few hours apart and see if I can compare them in dotMemory.

That sounds good.

Another thing which would be useful if you could do it would be to change the "off" box to still think it's writing the records - but use a fake gRPC implementation which just discards all requests. I can help write that code if it would be useful. (There are a couple of different options there.) If we see the memory go up even with a fake, then we should be able to work on something that reproduces the issue without ever having to touch GCP, which would really help. My expectation is that you'd be able to have the same code on all machines, and just a change to how you construct the BigQueryWriteClientBuilder for the "off" machine.

indy-singh commented 1 year ago

Actually, I didn't need to wait for a few hours. I took a memory dump of a 101 that has it turned on and 104 that has it turned off, obviously I can't share the memory dumps (they will 100% contain PII). But from very naïve first look, this jumps out.

Leaky box:

Non leaky box:

That size looks like a default buffer size, and we only partially use it. Reminds me of this: https://github.com/aws/aws-sdk-net/issues/894

Thoughts?

jskeet commented 1 year ago

Hmm... I can easily imagine that being a result of writing "a bit more than 64K" to a MemoryStream. (The capacity of the byte array doubles each time.) So even under the 85K limit of how much data we're trying to write, if it hits 64K and something is using MemoryStream, we could end up with 131072-byte arrays in the LOH. Note that Message.ToByteString() doesn't do this - it works out the size it needs first and allocates a byte array of that size, at least as far as I can see.

One way of testing whether this is causing the problem would be to take your batch size down from 400 (which you estimated to be "between 60K and 70K") to 300 (which should then be ~45K-52.5K). I'd expect the byte arrays backing any MemoryStream to then be 64K in size rather than 128K, and therefore not on the LOH.

It's still playing whackamole admittedly, but it feels like we're making progress - thanks very much for this! I'll have a look at places we use MemoryStream in the various layers I'm familiar with (Protobuf, GAX and the generated libraries).

googleapis / google-cloud-dotnet