tmobile / pacbot

PacBot (Policy as Code Bot)
https://tmobile.github.io/pacbot/
Apache License 2.0
1.29k stars 277 forks source link

Pacbot keeps on inreasing AutoScaling #340

Open luizfelipepg-zz opened 5 years ago

luizfelipepg-zz commented 5 years ago

Heyyyyy I notice that Pacbot keeps on increasing the AutoScaling group and in less than a week I have about 12 m4.xlarge instances running. Is that normal ? Any way to reduce that ?

Aha! Link: https://t-mobile1t-mobile.aha.io/features/PM-517

johnakash commented 5 years ago

@luizfelipepg , Those instances are launched by AWS batch service to process your data. If the data volume is huge AWS instances will be launched accordingly.
Please note that these instances will be terminated automatically once they finish batch-job execution.

luizfelipepg-zz commented 5 years ago

@johnakash , thanks, I was suspecting that this was the case but I notice that the instances are not terminating or at least looks like it is not. I have a few instances that were created days ago and never terminated. I have setup only 1 account for now and I am afraid to add the other 18 I have, this might create hundreds of instances ? :P

johnakash commented 5 years ago

@luizfelipepg , Could you please let us know which are the jobs running in batchjobs continuously? How to get job details - Login to AWS console >> Navigate to Batch service >> Select Jobs >> select queue pacbot-data

luizfelipepg-zz commented 5 years ago

Yeap, here is what looks like: Screen Shot 2019-10-14 at 12 16 35 pm Here is both queues: Screen Shot 2019-10-14 at 12 19 38 pm And here is showing we have already 27 instances running: Screen Shot 2019-10-14 at 12 16 51 pm

johnakash commented 5 years ago

@luizfelipepg , Could you please navigate to Jobs (make sure you are using pacbot-data as job queue)in AWS batch and grab a screenshot?

luizfelipepg-zz commented 5 years ago

I am sending the RUNNING list only, let me know if you need any other

You can see a lot of the Data Colletcor:

Screen Shot 2019-10-15 at 11 00 45 am Screen Shot 2019-10-15 at 11 00 59 am

There is a lot more of the Data Collecor, I just sent the start and the end

luizfelipepg-zz commented 5 years ago

I am already up to 40 instances :/

luizfelipepg-zz commented 5 years ago

By the way, I noticed that for the ones that are running, the logs keep showing:

2019-10-16 20:22:54 [main] DEBUG c.t.c.p.i.InventoryFetchOrchestrator - Transfer % Completed :0.0

luizfelipepg-zz commented 5 years ago

Ok, I decided to terminate all batch process running (for a long time)and that dropped all the instances. Let's see how it goes from here. Any information on what could have happened and how to prevent that ?

luizfelipepg-zz commented 5 years ago

Ok, another one started and seem to have a similar problem. I can see this error on logs:

[main] ERROR c.t.c.p.i.InventoryFetchOrchestrator - Delete Failed com.amazonaws.services.s3.model.AmazonS3Exception: The XML you provided was not well-formed or did not validate against our published schema (Service: Amazon S3; Status Code: 400; Error Code: MalformedXML; Request ID: 6F0E1817BD592070; S3 Extended Request ID: vMzvm3LpBFUxMe1t4QdxZckvBdrIf2F2uKd15TuQcqQEH37w0fsTfrXYU8DUs2aMrESVZbiUaUA=)

Just after this I can see this lines I can see

2019-10-16 22:03:03 [main] INFO c.t.c.p.i.InventoryFetchOrchestrator - End : Backup Current Files
2019-10-16 22:03:03 [main] INFO c.t.c.p.i.InventoryFetchOrchestrator - Start : Upload Files to S3
2019-10-16 22:03:04 [main] INFO c.t.c.p.i.InventoryFetchOrchestrator - Uploading files to bucket: xxxxxx  folder: inventory
2019-10-16 22:03:07 [main] DEBUG c.t.c.p.i.InventoryFetchOrchestrator - Transfer % Completed :0.0
2019-10-16 22:03:10 [main] DEBUG c.t.c.p.i.InventoryFetchOrchestrator - Transfer % Completed :0.0
2019-10-16 22:03:13 [main] DEBUG c.t.c.p.i.InventoryFetchOrchestrator - Transfer % Completed :0.0
2019-10-16 22:03:16 [main] DEBUG c.t.c.p.i.InventoryFetchOrchestrator - Transfer % Completed :0.0
2019-10-16 22:03:19 [main] DEBUG c.t.c.p.i.InventoryFetchOrchestrator - Transfer % Completed :0.0
.
.
.
.
.
.

I changed the bucket name to xxxxxx

Transfer line keeps repeating forever....

luizfelipepg-zz commented 5 years ago

@johnakash

RaidonCloud commented 5 years ago

Had similar issue.. just deployed pacbot in a test AWS account where i dont have nothing other than the VPC created for Pocbot.

Surprisingly i started seeing m4.xlarge instances creating after successful deployment.

The count goes to 40+ and keeps increasing where i dont have anything in this aws account where pacbot is deployed. It is a kind of hack, the code using my aws resources for other purposes?

End up terminating the pacbot deployment.

Any suggestions?

image