Open yuvaldrori opened 5 years ago
I have tried a couple of different times and been unable to reproduce this failure. Also based on the api responses in your debug log I can't see what the exact failure was because the Operation polling just shows "code": "UNKNOWN"
. Are you still able to consistently reproduce this error and if so can you look in the cloud console UI and see if there is a more detailed reason that the sql instance failed to create?
@chrisst just run the tf script again: one machine succeeded and the other failed with the same unknown error. The ui says exactly the same: "An unknown error occurred". I did open a ticket with Google support and they asked me to see if I can test it with a gcloud command - I was not able to repro with gcloud. The bash script I used: `
for i in 774yhf5 59swec6 do gcloud beta sql instances create gcloud-test-$i --async --database-version MYSQL_5_7 --tier db-f1-micro --region us-central1 --network network --project=some-project-name & done `
I'm still unable to reproduce your error with terraform, but using the UI I was unable to modify or delete multiple peering routes because of the error: "There is a peering operation in progress on the local or peer network. Try again later." It sounds like this could be what is happening with your config. Can you check http://console.cloud.google.com/networking/peering/list and http://console.cloud.google.com/networking/routes/list to see if there is a similar error on any of those automatically generated resources?
If there is, we should be able to tweak the lock on sql operations to account for the peering operations as well. It won't solve cross resource contention (sql + gke) but it should fix the count based sql clusters.
Sorry for the late reply, I just set up notification. Anyway, I ran the TF script again and was able to see the same errors and when I check the routes and peering list all looks OK:
gcloud alpha services vpc-peerings list --project test-tf-project-e93f7cc2 --network test-network
---
network: projects/221637507821/global/networks/test-network
peering: servicenetworking-googleapis-com
reservedPeeringRanges:
- private-ip-alloc
service: services/servicenetworking.googleapis.com
---
network: projects/221637507821/global/networks/test-network
peering: cloudsql-mysql-googleapis-com
reservedPeeringRanges:
- private-ip-alloc
service: services/servicenetworking.googleapis.com
gcloud compute routes list --project test-tf-project-e93f7cc2
NAME NETWORK DEST_RANGE NEXT_HOP PRIORITY
default-route-28a3d65e45473cb3 test-network 172.20.181.0/24 test-network 1000
default-route-b6db25e614d13793 test-network 0.0.0.0/0 default-internet-gateway 1000
peering-route-339f952832fab934 test-network 192.168.0.0/24 cloudsql-mysql-googleapis-com 1000
I don't get how come I can get this error every time and you cannot - what can be different in our setup?
Ok I was able to reproduce on a consistent basis by tearing down and spinning up the project and network connections at the same time. I was hoping this was something that could be controlled by locking SQL instance operations based on the project name but I don't think it's the case. At this point since it's only reproducible when other non-sql resources are being created it's not possible to identify this situation from within Terraform and so I'm not sure there's a good way to guard against it. I'm try and get a bug updated on the sql api to see if they can provide retries or a better error for this.
Update - I've been talking with the private networking team and they are working on a fix for this. They let me know that this is happening because there is an entry that gets set up the first time any private networking feature is turned on for a project/network. Creating the 2 instances at the same time causes a collision setting up this singleton, so if you are able to set up 1 resource that uses private networking before creating others in parallel you should be able to work around this issue.
Suffering from this issue as well. I posted a question and workaround in Stack Overflow: https://stackoverflow.com/questions/55990713/how-to-fix-an-unknown-error-occurred-when-creating-multiple-google-cloud-sql-i/55991852#55991852.
Can we get an update on this? We are running into it regularly when setting up databases for multiple environments and we have to do two separate terraform runs to work around this. The delay workaround does not really work in our case as we are using a module for cloudsql and you cannot have one module wait on the other (at least not in a simple non-hacky way).
Hi, any updates on this...?
Sorry no update at this point. The upstream team is still working on it and I'll update if I see that anything has been resolved.
+1 for looking for a fix for this.
+1 for looking for a fix for this.
Is there an update on this. I have been trying to debug an issue which turned out to be this.
At a minimum can a note be placed in the docs alerting devs to this short coming.
Hi Team, any updates issue has been opened for 4 years and apparently is still relevant to this day :)
@djboboch would you be able to share debug logs from a failing run? (Sanitized, since they'll be in public). This has been around a while, so I'd like to ensure we're still talking about the same issue.
It sounds like this is not SQL-specific, but may be related to private networking - so it should probably go to whichever team owns that.
Community Note
Terraform Version
Terraform v0.11.11
Affected Resource(s)
Terraform Configuration Files
Debug Output
https://gist.github.com/yuvaldrori/034fd15acff47edf83af77dea885fa36
Panic Output
Expected Behavior
All resources should have been created successfully. If you change the variable count = 1 it will succeed.
Actual Behavior
Only one CloudSQL gets created successfully.
Steps to Reproduce
terraform apply
Important Factoids
Tried similar script with one CloudSQL and one GKE cluster and many GKE clusters with private IPs - same results.
References
0000
b/261385017