mbegan / Okta-Identity-Cloud-for-Splunk

Public REPO for splunkbase app
https://splunkbase.splunk.com/app/3682/
Other
19 stars 13 forks source link

Recent e xperience on Splunk Cloud #30

Open s-m-p opened 3 years ago

s-m-p commented 3 years ago

Hello. First, I want to thank you for developing this addon. I have been running it in Splunk Cloud reliably for about a year now. The TA runs 21 inputs against 7 different Okta domains and is responsible for indexing about 15 million events per day. It provides us a significant level of visibility and value to our organization.

I encountered a serious problem with the TA over the past few days which required me to open up a support case. The first thing I wanted to ask is whether or not you planned to pass the TA through AppInspect to get it certified on v8.2? My support engineer warned me that our Cloud stack was just upgraded to 8.2, and wondered out loud if the issue I had could be incompatibility. I don't believe it is related, but I wanted to ask if you planned to get it certified?

Regarding the issue I had...I run the TA on our Inputs Data Manager (IDM). An IDM is essentially a Splunk-managed, cloud-based Heavy Forwarder. One evening, the TA logged a HTTP connection timeout to two of the seven domain names that we use. The main error was: urllib3.exceptions.ReadTimeoutError: HTTPSConnectionPool(host='ouroktahost.okta.com', port=443): Read timed out.

There's no way for me to know what triggered the timeout. It's cloud-to-cloud communication, and neither myself, Splunk or Okta found an explanation. One of the two inputs recovered on its own without any action. The other one did not. The problem input continued running on its normal schedule, but it would not pull back any events. What I eventually discovered is that before the timeout, the API HTTP request parameters for the problem input included "since", "limit" and "after". But after the timeout, the API calls only included an "after" request parameters.

I opened a case with Okta and unfortunately, they would not support the addon. Basically they just noticed the change in request parameters and told me to figure out how to add the additional parameters. And that's where the major problem started. In order to recover this one specific input, I had to ask Splunk to:

  1. Remove the TA from the IDM and restart
  2. Install the TA on the IDM and restart
  3. Re-create all of the inputs

At first we tried uninstalling the TA and reinstalling without a restart in between, which left the KV Store in-tact and that did not fix the issue. What eventually did work is to uninstall the TA, restart Splunk, install the TA, restart Splunk and manually re-create the inputs. But that was not something the Splunk support organization was prepared to do. It took a lot of internal discussion and escalation to get them to do it.

Ultimately, I think what fixed it that the restart between the uninstall/reinstall deleted the KV Store. I'm no python expert, but I studied the TA a bit and it looks to me like the KV store holds a checkpoint which tells the TA where to pick up from. But this HTTP request timeout seems to have triggered a bad interaction between the TA and the KV Store which caused the API calls to be missing a couple of request parameters. Maybe clearing out the KV Store would have restored normal functionality, but we weren't able to figure out how to do that specific task on the fly.

I'm not sure if this should be considered an "issue", but I wanted some mechanism to provide you some feedback on our recent experience with the TA. I was hoping you might have some thoughts on what might have caused the TA to remove some of the request parameters from the API calls after a request timeout.

Regardless, the TA is really valuable to us. Thanks for sharing it with the community.

mbegan commented 3 years ago

Glad you are finding it helpful and thank you for reporting the issue.

From the looks of things you are hunch is right - clearing out the checkpoints from the KV store ( There is an unexposed routine to do this in the input https://github.com/mbegan/Okta-Identity-Cloud-for-Splunk/blob/b68a785c0cdc49a0be1db4f940b92634f94cd60b/bin/input_module_okta_identity_cloud.py#L805 )

Were the inputs that had issues log inputs or one of the other user/group/app inputs?

I have some fallback logic that is intended to recover from such a situation but clearly there is a state you can get into that i'm not handling right.

s-m-p commented 3 years ago

I'm sorry I neglected to mention I'm referring to the Logs metric. We are also pulling the Apps and Users metric from the same host. I don't believe those inputs were affected, but I don't monitor those inputs as closely and they are also on a lower-frequency schedule. If it's important to you, I can spend a bit of time validating.

As I mentioned before, my python-fu is not that strong nor have I developed any TAs for Splunk. I mostly just deal with the config files. Would you mind being a bit more explicit about that section of code you linked to? It seems you've incorporated another "zset" metric that could reset the checkpoints, but the code is commented and I'm not sure how one would actually execute those calls from Splunk. Would I need to deploy a modified version of the TA to Cloud with those lines uncommented? How would I execute them?

s-m-p commented 3 years ago

Just a quick comment to say I looked at the apps and user metrics and those don't seem to have been disrupted. I mentioned before those two inputs are much lower volume than our logs input, and are on a lower-frequency schedule. Or logs input is scheduled to run every 60 seconds, the apps every 900 and our user input runs once daily.

mbegan commented 3 years ago

You are right that zset snippet isn't super useful alone it was just a routine I would use in my dev env to clean things out and start from scratch. If you wanted to do something like that it would involve using curl or postman to interact directly with the KV store in Splunk. Doing something like that is very much Splunk foo stuff that I shy away from telling people to do without really understanding it. My awkward reference to the zset routine was more to help you understand where in the KV store you need to remove entries (Basically the path and name of the checkpoint(s)).

For the other inputs (non log inputs). I think these inputs are highly misunderstood. These aren't "logs" related to users/groups/apps and the access or changes to those object types. These inputs are dumps of the users/groups/apps objects as they exist in the directory at the time the job runs. I point this out because I always tell customers that they shouldn't even bother using those inputs or they should use them very sparingly based on their needs.

The most common example of value those inputs provide is they ingest the raw data that is parsed by a set of saved searches that ultimately populate a handful of lookup tables that are then used to perform a handful of automatic lookups that can enrich log data at search time.

In the example I linked to above we use the Okta ID user_id (which is actually the actor.id) from a log entry to lookup the additional details of a user. This automatic augmentation of the log entry itself can be helpful in better understanding who the actor is.

I have never seen a customer use the raw directory data in any other meaningful ways and any attempts i've made to produce reports over the data have been efforts in futility. The data isn't time series data and Splunk doesn't seem like the right tool to use against it.

With all of that for justification I would highly suggest reducing the interval of the non-log inputs to daily for users/groups and weekly for apps. Anything beyond that is just wasting API calls against Okta and space in Splunk (IMO).

mbegan commented 3 years ago

Oh yes.

I totally forgot that I do have another workaround that doesn't involve making any changes to the KV store or uninstalling the add-on etc. I've only had to do this a few times for odd situations like yours but it works and doesn't require anything special.

The KV store location for log data is driven by the "Account Name" you provide when configuring the domain and API token. If you create a new account (could be the same API Token, could be a new one, it doesn't matter it just has to be a valid token) and then disable the input(s) you have and recreate it/them referencing the new account it will be just like starting over because the KV location is based on the "Account Name".

If you do this take note that you'll want to pay close attention to your "log history" setting. The log history setting is only relevant to log inputs on the very first time that they run and it dictates how far in the past it will collect data (from the moment it first runs). From that point forward the add-on will use the data from the KV store to pickup where it left off.

So if you noticed your input hadn't collected data for 14 days and you wanted to backfill you'd set the "log history" setting to 14 (or 15 to prevent gaps and have a tolerable overlap) before you configured the new input.

mbegan commented 3 years ago

One more thing while i'm on this topic (i'll probably use this bug report to publish a blog or something)

The search time log enrichment I mentioned above (auto lookups) is great but I've also had a handful of customer express concern for the burden it puts on their Splunk infra.

The first potential problem is the lookup tables have to be sync'd across search heads and there is a mechanism that Splunk uses to do this that is sensitive to the size of the bundle that gets sync'd.

Fun fact: the Splunk app validation tool won't let me include a saved search that is disabled in a package.

In those cases I usually have a customer truncate the larger or less useful lookup files and disable the associated saved searches so they don't get populated again.

The second problem is the added overhead of the auto lookups themselves have on searches. I haven't been able to quantify it and it doesn't come up often but everyone now and then customers will complain about search performance after installing the TA on their search heads.

s-m-p commented 3 years ago

Sorry for my delayed response. Thank you for all this additional insight and background.

The context around the apps/users/groups inputs makes perfect sense and I have toned down my schedule on one of them. Count me in with those who were affected by the size of the resulting lookups. I had to tune things to prevent the saved searches and resulting lookup sizes to prevent impact to our Splunk platform. I often wondered if our Okta team found any value from those lookups - I never heard anyone mention the fact that I never even created the groups input. I will likely turn them off completely. I do have one point of feedback on the lookup-generating saved searches. The default time boundary for those searches evaluates to "All Time". In an environment like ours, that can be (and was) a very heavy search load. A more sensible default time range might be better suited with something like Last 24 Hours to reduce potential search load and let the end user modify the search if they want to really search All Time. My 2 cents.

And thanks for your comments about the KV store. With a bit more time, I could have constructed a curl request to clear out the KV store lookup. I just didn't have the programming skills to come up with it on short notice. But your idea of simply creating a new Account Name is brilliant, and I am slapping myself in the head for not thinking about that myself. That is a great idea, thanks for mentioning it!

mbegan commented 3 years ago

Re: default time range of "All Time" for saved searches.

That is the very rub of using objects like users/groups and apps that aren't time series data (streams of events) in a system like Splunk the processes time series data.

Users/Groups/Apps didn't "occur" at a time - they simply exist. They may have been created or updated at a certain time - for which there will be a log entry ingested by the log metric - but the object itself exists.

So to make sure i get a complete list of users the time range has to be "All Time" or i'd start to exclude users/groups/apps that were ingested outside the search window time range.

s-m-p commented 3 years ago

I can understand your perspective. I guess mine is a little different. Using users as an example, it seems the approach you took when you built the addon was to compile complete list of users over the entire history of data. Instinctively, I expected a complete list of users only since the last execution of the input (based on my understanding of your explanation that the input generates a complete list every time it runs). I guess neither approach is right or wrong. But maybe people using this addon might avoid some of the search load issues we mentioned if the defaults reflected my assumptions.

But it's your addon, I'm just giving a different perspective. Either way, it's well written and valuable. Thanks again for sharing it.

mbegan commented 3 years ago

This is less perspective or even opinion and usually me just dealing with the capabilities of the APIs and my limited understanding of Splunk ;-)

The different APIs that supply the Users/Groups/Apps have some differences in their collection behavior and capabilities.

Users and Groups are both a full collection of on first run and then "deltas" (based on last updated) going forward.

Apps are a full dump each time the input runs (it also happens to produce the most API calls. ( Number of Apps * Number of Users assigned / 200 )

I guess if you wanted to optimize this you'd probably do something like:

setup an index that would be dedicated to the users and groups metric - periodically deduping the data stored in that index by id so you don't have to "carry" around 100's of copies of a user object as it changes overtime. You'd still need to keep the saved searches that are searching the users/groups objects looking at "all time" but the data should be optimized (maybe?)

Then use another index for storing the app metric data and maybe wipe it out every few weeks since it is going to be duplicated repeatedly. You'd still need to keep the saves searches looking at "all time" because I tell Splunk to use the lastUpdated time from the app as the "event's time"

Does this "indexing" thing make any sense?

Thanks for the feedback - it is invaluable.

s-m-p commented 3 years ago

Yes, I understand exactly what you are describing from a pure Splunk perspective. And now that you've educated me a bit on how the API works, it does alter things assuming I understand correctly. I guess if the app metric is a complete dump every time it runs, then I would suggest setting DATETIME_CONFIG = CURRENT instead of using the timestamp from lastUpdated field. As you said they would both produce the list of apps each time it runs - the only difference is what time boundary you need to use to find them. The advantage is it would allow you to change the time boundary of the lookup-generating search to something shorter than All Time. And you could still use the lastUpdated field in a search if you needed to since it's in the raw event.

I also agree with you on users/groups - you'd probably want to set up a summary or alternate index to optimize the retrieval of those events.