google / clusterfuzz

Scalable fuzzing infrastructure.
https://google.github.io/clusterfuzz
Apache License 2.0
5.29k stars 554 forks source link

High volume of GCS bucket logging #1991

Open urbanenomad opened 4 years ago

urbanenomad commented 4 years ago

We are running 100 pre-emptible VMs in our clusterfuzz which is running in GCP and we seem to be getting very high volume of GCS bucket stackdriver logging which is running into almost 2TB in a month. The cost seems pretty high. Do we need to do all this logging for GCS bucket data access logging? Can we set it to just warning or error?

inferno-chromium commented 4 years ago

Can you submit a patch, where is this logging happening. We don't want it either.

urbanenomad commented 4 years ago

The logging seems to be happening with the GCS Bucket Resource Logs Ingestion which is reporting almost 2TB of data.

https://drive.google.com/file/d/19y9LVCTKrmwfo3eXSMLd2FBspGSGhMhb/view?usp=sharing https://drive.google.com/file/d/1_IQe_mIRT4Kuf2gkJxfY7Nm14Gk0PAEW/view?usp=sharing

Keep in mind this project only contains the clusterfuzz app and all the servics needed for it to run and nothing else.

inferno-chromium commented 4 years ago

You need to google to see how to disable these, and then we can discuss more. Right now, wont have time for another week or two to investigate this.

urbanenomad commented 4 years ago

I already opened a support ticket on this from google but was wondering if this is something new or if anyone else have experienced this? Was hoping there would be a simple configuration change but if this is something new I will continue to talk to GCP support on this.

urbanenomad commented 4 years ago

The logging seems to be costing substantially more than the compute.

inferno-chromium commented 4 years ago

I have not seen this in three different project deployments, that logging is more than compute, so some weird config. but i do remember stackdriver logging is a little more cost than expected.

urbanenomad commented 4 years ago

And on a side note does the buckets need to be multi-regional buckets? Can we set them to be a specific region since all the VMs are all on the same region anyway? As you can see the cost Multi-Region Standard Class B Operations seems also high.

inferno-chromium commented 4 years ago

And on a side note does the buckets need to be multi-regional buckets? Can we set them to be a specific region since all the VMs are all on the same region anyway? As you can see the cost Multi-Region Standard Class B Operations seems also high.

yes you can keep specific region to save some costs. but we have multi region buckets, and have never seen that high cost. maybe check bot.log, is something constantly failing and talking to gcs.

urbanenomad commented 4 years ago

I looked at a couple of bot logs in the VMs that are running and I don't see too much failing. But then again I am only checking a small sample set of VMs. Other than sshing into the VMs to look at the bot logs...is there an easier way to find those logs?

inferno-chromium commented 4 years ago

I looked at a couple of bot logs in the VMs that are running and I don't see too much failing. But then again I am only checking a small sample set of VMs. Other than sshing into the VMs to look at the bot logs...is there an easier way to find those logs?

Just use the stackdriver , all bot logs should be going there. Dont need to ssh into vms.

urbanenomad commented 4 years ago

So I found away to filter and reduce the logging cost...I will have to see how this filter lowers the overall cost. But I think this is more of a symptom of a deeper issue. I also notice that the GCS class B operations is very high with about 1,771,815,976 operations. I know the engineering teams have been putting more jobs for fuzzing. But is that normal to run so much read operations in GCS? Could it be how the engineers are setting up their fuzzing jobs? What causes reads from the VMs to GCS?

inferno-chromium commented 4 years ago

Just too busy for next week or two.

Are you using different buckets per job ? Maybe something in GCS costing with too many buckets, is your GCS sales/support guy can help to debug where these costs coming from.

urbanenomad commented 4 years ago

yeah...I don't expect you guys to debug our issue. Just some advice or guidance. So the vast majority of get requests are being applied to the corpus bucket. I would say that is probably where the issue resides when it comes to get requests from the buckets. So the cost I found is negligible for class B operations between on multi-regional vs. regional. It does seem like the high cost is coming from the VMs making high number of get operations to the corpus bucket. What part of the fuzzing task requires getting information from the corpus bucket?

inferno-chromium commented 4 years ago

fuzz_task is the one which makes big costs it could be one of these functions for high gcs usage https://github.com/google/clusterfuzz/blob/master/src/python/bot/tasks/fuzz_task.py#L492 https://github.com/google/clusterfuzz/blob/master/src/python/bot/tasks/fuzz_task.py#L535 https://github.com/google/clusterfuzz/blob/master/src/python/bot/testcase_manager.py#L838

you can place a return; in these functions and then check which is major culprit [based on cost going down after that change]

urbanenomad commented 4 years ago

Quick question...does the fact that we are using 100 pre-emptibles and standard VM cause more reads from GCS? would it be cheaper to run more than 1 standard VMs from a read GCS standpoint? Should we go with more 20:80=VMs:pre-VMs?

inferno-chromium commented 4 years ago

Quick question...does the fact that we are using 100 pre-emptibles and standard VM cause more reads from GCS? would it be cheaper to run more than 1 standard VMs from a read GCS standpoint? Should we go with more 20:80=VMs:pre-VMs?

We also use preemptibles mostly, like several thousands of them and standard ones are just a few hundreds. This should n't be related to standard or preemptibles.

oliverchang commented 4 years ago

Do you possibly have audit logging turned on for your GCS buckets? This shouldn't be the default, but perhaps you have this set at the organization level.

See https://cloud.google.com/logging/docs/audit/configure-data-access

I don't believe ClusterFuzz is the one generating this volume directly here.

urbanenomad commented 4 years ago

Do you possibly have audit logging turned on for your GCS buckets? This shouldn't be the default, but perhaps you have this set at the organization level.

Yes we do have logs turned on for security audit. So after thinking about this further the logging is a symptom and the major reason for the high cost is the GCS reads. I was able to filter out the logging cost, but the GCS reads are still very high. The majority of the reads are with the corpus bucket.

corpus.qct-clusterfuzz1-prd.appspot.com: GetObjectMetadata: Request Count 44,580.00 corpus.qct-clusterfuzz1-prd.appspot.com: ReadObject: Request Count 44,536.00

oliverchang commented 4 years ago

This is expected, since corpora can be large and this is multiplied by the amount of bots you have. However in your screenshots I still see that your logging costs are higher than the actual GCS costs, so our only recommendation at this point would be to turn off audit logging

urbanenomad commented 4 years ago

so we did turn off logging for reads into the corpus bucket and that seem to have drastically reduced the cost for logging. But we still have lots of read operations on the corpus bucket for about a month for 100 n1-standard-1 (with 100 cpu's) VMs we had about 3billion read operations. Would it be better to have 50 n1-standard-2 which would still give us 100 cpu's but would have less number of bots? What is more important for fuzzing the number of cpu's or the number of bots? Also would we reduce the number of read operations if we used less number of bigger bots?

inferno-chromium commented 4 years ago

logging for reads and writes does not make any sense, since it will be a ton, so great you disabled it. i dont think we ever enabled it.

n1-standard-2 wont help here. i think if you do non-preemptible/regular bots, you will save disk, so corpus wont need to be rsynced. but regular bots >>> preemptible bot w.r.t cost. @oliverchang - any thoughts here.

urbanenomad commented 4 years ago

What is your ratio of regular bots to pre-bots in your clusters?

inferno-chromium commented 4 years ago

1:20 (non-pre:pre) roughly