snowflakedb / SnowAlert

Security Analytics Using The Snowflake Data Warehouse
Apache License 2.0
181 stars 57 forks source link

Operation Logs from Azure Active Directory connector not seen in Snowflake table #414

Open Chaitali-Sonparote opened 4 years ago

Chaitali-Sonparote commented 4 years ago

Hello Team,

I created a connection to Azure Active Directory connector for Operation Logs. The connection was established successfully but the Operation logs were not loaded in the Snowflake table.

I have enabled the operations logs for key vault following the document (Kindly let me know if this is correct) - https://docs.microsoft.com/en-us/azure/azure-monitor/platform/diagnostic-settings

And the folder structure created in Azure Storage Account for Operation Logs is - Container name : insights-logs-auditevent, folder structure - resourceId=/SUBSCRIPTIONS//RESOURCEGROUPS//PROVIDERS/MICROSOFT.KEYVAULT/VAULTS//y=2020/m=07/d=14/h=05/m=00/logfilename.json

Then I checked through the GET_TIMESTAMP_FROM_FILENAME_SQL function in azure_log.py and the mechanism to extract the timestamp from this folder structure did not satisfy the requirement.

Hence, I tried using the following function and it worked and data was loaded in the Snowflake table GET_TIMESTAMP_FROM_FILENAME_SQL = { 'operation': r'''to_timestamp_ltz( split_part(split_part(metadata$filename, '/', 10) , '=',2) || '-' || split_part(split_part(metadata$filename, '/', 11),'=',2) || '-' || split_part(split_part(metadata$filename, '/', 12), '=', 2) || 'T' || split_part(split_part(metadata$filename, '/', 13), '=', 2) || ':' || split_part(split_part(metadata$filename, '/', 14), '=', 2)) ''', }

Kindly let me know if this approach is correct.

Thanks, Chaitali Sonparote

sfc-gh-afedorov commented 4 years ago

Thanks, yes, it looks like that double-split is indeed more robust to various ops logs. Let me know if you'd prefer it as part of our codebase? If you'd like, I can double check it works against the logs we are ingesting and include it in our next release. Regardless, I'm asking that we prioritize our REGEXP support commented out below for external release -- we're using it internally on a lot of these ingestions and it simplifies the code a little bit.

Not related to this piece but worth noting — right now, the cost of ingestion once the flow starts is not optimized as well as it could be (the time it takes to MERGE data into destination increases linearly with the size of the destination table). If you'd like, I can describe an optimization we found recently but haven't yet implemented, or when we launch support for streams on top of external tables, you can migrate to an INSERT-only model that's fully optimal.

Chaitali-Sonparote commented 4 years ago

Thanks for your response!

One correction though in the folder structure, there is no empty double splits. The folder structure is as - resourceId=/SUBSCRIPTIONS//RESOURCEGROUPS//PROVIDERS/MICROSOFT.KEYVAULT/VAULTS//y=2020/m=07/d=14/h=05/m=00/logfilename.json

If the REGEXP support is going to solve the need of extracting timestamp for all three logs types (audit, sign-ins,operation) for this connector then it will be good to have it operational in the next release.

Thanks, Chaitali Sonparote

sfc-gh-afedorov commented 4 years ago

Not sure what you mean by "empty double splits", but I think counting the split_part(..., '/' ...) from the end might be more robust to the folder structure, as it doesn't assume anything about the depth, just the inner-most structure match /y=2020/m=07/d=14/h=05/m=00/logfilename.json. See, e.g. —

> SET filename='/SUBSCRIPTIONS//RESOURCEGROUPS//PROVIDERS/MICROSOFT.KEYVAULT/VAULTS//y=2020/m=07/d=14/h=05/m=00/logfilename.json';
+----------------------------------+
| status                           |
|----------------------------------|
| Statement executed successfully. |
+----------------------------------+
1 Row(s) produced. Time Elapsed: 0.256s

> SELECT to_timestamp_ltz(
    split_part(split_part($filename, '/', -6), '=', 2) || '-' ||
    split_part(split_part($filename, '/', -5), '=', 2) || '-' ||
    split_part(split_part($filename, '/', -4), '=', 2) || 'T' ||
    split_part(split_part($filename, '/', -3), '=', 2) || ':' ||
    split_part(split_part($filename, '/', -2), '=', 2)
  ) t;
+-------------------------------+
| T                             |
|-------------------------------|
| 2020-07-14 05:00:00.000 +0000 |
+-------------------------------+
1 Row(s) produced. Time Elapsed: 0.724s

The REGEXP support is commented out in the code, will update it as soon as we launch a Snowflake release that supports it.

Same re: streams on external tables — we're expecting to see this in a future release, but don't have a concrete timeline for when it'll make it through the process.

Chaitali-Sonparote commented 4 years ago

Thanks for the update on REGEXP support.

By "empty double split" I meant that the path resourceId=/SUBSCRIPTIONS//RESOURCEGROUPS//PROVIDERS/MICROSOFT.KEYVAULT/VAULTS//y=2020/m=07/d=14/h=05/m=00/logfilename.json has empty //. But that is completely fine as you are splitting the path from the end.

Thanks, Chaitali Sonparote

sfc-gh-afedorov commented 4 years ago

Yup, I think they're actually there to make counting slashes work, but just in case, counting from end seems easy enough :-) We can launch this as part of v1.9.5 early next week.

Chaitali-Sonparote commented 4 years ago

Great! Thank you!!

Thanks, Chaitali Sonparote

Chaitali-Sonparote commented 4 years ago

Just wanted to confirm if this issue is handled in v1.9.5. Could you please share the link of the codebase

Thanks, Chaitali Sonparote

sfc-gh-afedorov commented 4 years ago

Not yet -- I update the code here but haven't finished testing it yet. Apologies for the delay.

Chaitali-Sonparote commented 4 years ago

No problem, Thank you!

sfc-gh-afedorov commented 3 years ago

fixed in 947c394 and will be merged early next week. apologies again for this taking so long.