Closed DawnmarieDesJardins closed 5 years ago
"All event data is stored written to an Azure Databricks for PostgreSQL managed Hyperscale (Citus) database" should read as "All event data is stored written to an Azure DATABASE for PostgreSQL managed Hyperscale (Citus) database" - Instructor Guide
@midomsft good catch. Fixed :)
I'm assuming you want us to skip sections that say things like: [insert your custom workshop content here . . . ]
Diagrams in the student and instructor guides are clear, and very nicely put together. Thumbs up!
My only question is around why Data Lake is not specified as Gen1 or Gen2.
Correct "Postres" to "PostgreSQL" in WDS deck, slide 19. :-)
In the hands-on lab, Task 6: Deploy Azure Database for PostgreSQL : you need to specify which type of PostgreSQL instance they need to choose : Single server or Hyperscale (Citus) server group - PREVIEW.
In the prep for the lab, this part confused me: Create a variable to hold your resource group name. This will be used when creating other resources. Replace SUFFIX with your Microsoft alias, initials, or another value to ensure uniqueness. resourcegroup=hands-on-lab-SUFFIX Create a variable to hold your resource group location name. Replace the westus location with a location closest to you. This same location will be used when provisioning other Azure resources. location=westus For a list of valid location names, execute: az account list-locations -o table Enter the following command to create a subscription group. az group create --name $resourcegroup --location $location
Instead of creating the variables, because it didn't work, I just hardcoded the values when creating the resource group. I've never used Bash before, so I probably did something wrong with the variable creation. But what I did was just define the variable like this and press enter:
variablename = value
I went back to try it again, and did not include the spaces in my variable definition. I think it is worthwhile to put a note in there about that as many of us come from different programming languages and are used to that. Anyway, it works now.
Try without spaces.
Hey, just completed the "Before the HOL" section - all works fine, great job :-)
One question - on task 8 it says "From your LabVM, download the starter files by downloading a .zip copy of the Cosmos DB real-time advanced analytics GitHub repo..."
Did I miss a step where you have to create a LabVM? I didn't notice another prerequisities section at all (could be me of course!) - does a user need to create a VM or can they use their own laptop?
Is it normal that task 8 ask to redirect to "Cosmos DB real-time advanced analytics GitHub repo" ?
In the exercise 1, I was not able to connect to the instance because of the firewall. I needed to add a rule to allow all the IPs to connect (0.0.0.0 to 255.255.255.255) because it was not recognizing my public IP neither my private IP. I'm connecting through the MS corp network.
Exercise 3 (Kafka setup) for the Hands on Lab instruction says -
"Within the folder, open the appconfig.json file in a text editor, such as Notepad."
However the actual file (and screenshot in the manual) show appsettings.json
Probably needs updating so instructions refer to correct file.
Hey, reaching a breaking error in exercise 3. I've changed the appsettings.json, but when I try and run KafkaProducer.exe I get the error attached. Should there be a lib/[dlls] folder included in the app source or have I missed a preq somewhere? I've looked but can't see anything.
I'm running the windows version of the app.
@midomsft Sorry about that. I've just committed an update that includes all the missing files if you want to try downloading the new version and try again.
Thanks @joelhulen that's perfect, now working :-)
Ok completed both the instructor guide and hands on lab walk through now. Aside from a couple of issues (responded to above) this all went smoothly. Excellent job!
Great, and thank you! I will address all non-breaking issues once the review process is complete.
I'm going to skip this task and come back to it when I get an answer, but I know other things will not happen unless it is fixed. Details below.
Task 3 Step 8 giving me an insuffienct privelages error:
ayman@Azure:~$ az ad sp create-for-rbac -n "wwi-oss-sp" --role "Storage Blob Data Contributor" --scopes /subscriptions/XXXXXXXXXXXXXXXXXXXXXX/resourceGroups/hands-on-lab-ayelghaz/providers/Microsoft.Storage/storageAccounts/wwiadlsayelghaz Changing "wwi-oss-sp" to a valid URI of "http://wwi-oss-sp", which is the required format used for service principal names Found an existing application instance of "2b7a55ff-62f9-49d6-8324-c236c362c9a2". We will patch it Insufficient privileges to complete the operation.
@thesqlpro Are you using your own pay-as-you-go or MSDN-based Azure subscription, or a sponsored subscription provided by Microsoft? Please look at the Requirements listed in the Before the Hands-on Lab document. You must have permissions within your Azure subscription to create an App Registration and service principal within Azure Active Directory.
@joelhulen I'm using our internal MS subscription. I've done app registration before in AAD. I'll look into repeating these steps, but I don't know if my MSDN subscription still works.
@joelhulen I just skipped the step and completed the rest of the lab. I manually inserted data into the Postgres events table to test out my dataframe and the PowerBI report. It wasn't streaming because of that step that failed due to permissions (our exchange above). I will try to reproduce this lab if I can get my MSDN working properly.
Overall, the lab is easy to follow. Clean code. Everything works as expected. I really enjoyed working on this and learning. The comments in the notebook are easy to follow.
Thank you for including me in this!!
Great work @joelhulen . I'm still going through the content but wanted to add in feedback on the trainer / student guides and the outline sections. I'll go through the hands on labs tomorrow and then add feedback on that as a separate comment.
I'm adding comments as checkboxes to make it easy for you to address.
Reason: Pre-aggregated data seems a bit too specific for this stage of the discussion.
[x] Section 5: Change header to advanced dashboards.
[x] They would like a way to more rapidly create reports and be able to display them on a dashboard that can be customized and show real-time updates. They would like a way to create visually rich reports and be able to display them on a dashboard that can be customized and show real-time updates.
[x] 1. Paragraph 2: Because of this, the query engine parallelizes and routes incoming SQL queries across these servers for faster response times on large datasets.
[x] 3. Partitioning is the key to high performance and being able to scale out across several database nodes as it allows you to break up data into further smaller chunks based on time windows. One of the keys to fast data loading is to avoid using large indexes. Traditionally, you would use block-range (BRIN) indexes to speed up range scans over roughly-sorted data. However, when you have unsorted data, BRIN indexes tend to perform poorly. Partitioning helps keep indexes small. It does this by dividing tables into partitions, avoiding fragmentation of data while maintaining smaller indexes. In addition, it allows you to query only a smaller portion of the data when you run queries for particular time windows leading to faster SELECT performance.
[x] 1. I'd add some more content to pg_cron. Maybe add this after WWI will create rollup aggregation functions that can be scheduled to run on a periodic basis using pg_cron. Part to add: pg_cron is a simple cron-based job scheduler for PostgreSQL that runs inside the database as an extension. That way, you don’t need to have a separate server to schedule cron jobs within your database. As it uses the same syntax as regular cron, it simple to schedule execution in a number of ways, such as minutely, hourly, daily, weekly, monthly, day of week, etc.
[x] 3. We are using hll_has_biginthll_hash_bigint to hash the HLL columns device_id and session_id
I don't think we need to get into coordinator and workers at this stage.
[x] 2. The sharding logic is automatically handled for you by the Hyperscale (Citus) server group.
[x] 2. All of this can be done with minimal schema changes, and potentially no and code changes.
[x] 4. The Hyperscale option uses Citus to provide the distributed data storage capabilities like sharding and various partitioning plans. capabilities like sharding, query routing and query parallelization. However, it wasis built to beas an extension of PostgreSQL...
Migrating our PostgreSQL databases to Azure with the Hyperscale server group (Citus) deployment option...
@joelhulen, really well done! It's straight forward with no pitfalls. Just a typo in the pre-lab, where the title Task 3 is used twice.
Preparation:
@miyamam, In the trainer guide it explains what were the customer's struggles, and it says that they have multiple tenants, and that they need to shard the Database to make queries run faster.
@elsantorini But current WWI's data size is so small that single instance is enough.
@miyamam I agree about data size, but since the Citus-specific capabilities of the Hyperscale version are the things being highlighted and used in the lab, it is a requirement for successful completion of the lab, and for the scenario. We could provide a much larger test database, but that would be an unnecessarily large dependency for the lab.
So, I went through the prep for HoL and had the following comments.
@samay-sharma I've promised to hold off on commenting until after the review ends, but you bring up some good points (and you're currently blocked).
Thanks @joelhulen . So, I tried to complete the Hands on Lab but had a few issues:
While executing Task 3, step 8, I got Insufficient privileges to complete the operation. It looks like my subscription does not have permissions within your Azure subscription to create an App registration and service principal within Azure Active Directory to complete this task. Is that correct? If so, I'll look to get this addressed on my end.
While trying to create a Lab Cluster as per Exercise 2, Task 5, I got an error. The error was:
Time
2019-06-26 17:01:01 PDT
Message
Cluster terminated. Reason: Cloud Provider Launch Failure
A cloud provider error was encountered while launching worker nodes. See the Databricks guide for more information.
Azure error code: ServerUnreachable
Azure error message: Azure API server unreachable, please try again later.
I don't know if this is something I did or if it is an issue. Can you help me with this?
For the Postgres / Citus pieces, I was able to work through all of them (just without data) and have a few inputs. I'll put those in my next comment.
Exercise 1 - Task 2: We have a description for partitioning which is same as what we have in the Trainer Guide. Can you update this according to feedback there as well?
Task 3: We run create_distributed_table
in task 2 but don’t describe it. I think create_distributed_table should be in the Shard your tables section and not in task 2.
I also think we should include 2 lines to explain what create_distributed_table does. We can say “ that we are sharding each of the tables on customer_id column. “ This is done by calling create_distributed_table function. When you run this function, Citus inserts metadata marking the table as distributed and creates shards on the worker nodes. Then incoming data into these tables is routed to the right node based on customer id.
Then, we can continue with:
Because we are sharding on the same ID for our raw events table and rollup tables, our data stored in both types of table are automatically co-located for us by Citus....
We should also create a separate section for create rollup tables.
We should also explain why rollup tables are useful. Something like:
In this task, you will create two rollup tables for storing aggregated data pulled from the raw events table. Rollup tables allow you to pre-aggregate data for certain commonly run queries so that you don’t need to aggregate the data repeatedly during query time. This allows you to have longer retention, compression of results and faster query time by reducing the amount of data to scan.
The rollup tables you will create are...
Exercise 2
Exercise 4:
This section has some good content about rollup tables. I'm thinking if we should move the creation of rollup tables in this section. That’ll make sure people understand why we are creating rollup tables, why we are colocating them and how we are planning to do the rollup and deal with delayed data all in one section. Curious on what your thoughts are?
In paragraph 5, Do we need to cover the approaches we are not picking? I think the information there is pretty brief and most users won’t understand what we are trying to say from that much content.
So, maybe we can do something like this:
You still need to be able to keep track of which events have already been aggregated. We will do that by tracking sequence number. There’s a few other approaches to do this and you can read more about why this approach is recommended here (link to blog post).
Task 2:
A few comments on the queries we have in task 2 and structuring.
If we do that, we can add a description saying: “The following queries don’t have customer id in the filter so these queries will be executed in parallel across all different nodes in the cluster leading to fast query performance.
For the queries with the filter, we can say “As these queries have a filter on customer_id, Citus will route the queries to only the node which has the data for that particular customer without needing to touch data for the remaining customers. This leads to faster performance as you need to scan only a small portion of the data.
To illustrate these better, I think we should do 1 EXPLAIN example each for a single-shard query and a cross shard query to make sure they understand these concepts and we can clearly show how one of them has 1 task and how another has 32.
I also think we should run a query against the raw table trying to compute the same aggregate and one against the hourly rollup table query we have here and demo that querying the rollup table was much faster by showing query runtime in pg_admin. That'll help them understand and actually see the benefit of rollup tables.
@joelhulen : That would be my first round of feedback focussed primarily on the Postgres / Citus pieces. There is an important point around me not able to do the spark cluster creation and hence ingestion and the visualization piece. If you can help me fix the spark one (the other one I should be able to fix on my own), I can complete the rest of the lab and provide more complete feedback.
@samay-sharma in the pre-requisites listed in Before the HOL, we state the following:
If you are using a Microsoft-provided Azure subscription, then you will most likely not be able to complete the OAuth 2.0 access components.
In Task 3, Step 8, you need to make sure you've replaced the -scope
value with your own. Again, if you did then it may have to do with your permissions as described above.
I am not sure why you cannot create a cluster. That seems like a platform-related error. Maybe try again later? Worst case, delete the Databricks workspace and re-create it.
@samay-sharma Were you ever able to complete the lab, given the information above? Today I plan on applying changes to the MCW based on feedback here, since feedback was officially closed EOD Wednesday and I need to finalize today. Please let me know soon if you have additional feedback or if you could not complete for some reason.
@joelhulen : I haven't. The databricks cluster creation still hasn't worked. I retried creating the cluster and still gives me the same error.
I'm not sure how to do the Oauth 2.0 access components. I don't have any other subscription other than my Microsoft one. So, not sure how to take care of that.
The PowerBI stuff is fine.
I think my major feedback has been around the Citus pieces and it looks like others were able to do the lab successfully, so this seems something specific to my account.
I have addressed all the feedback and made a PR to the official Microsoft repo: https://github.com/microsoft/MCW-Real-time-data-with-Azure-Database-for-PostgreSQL-Hyperscale
Thank you so much, everyone, for taking your time to review this MCW. It was great working with you!
Please leave your review comments for our authors here.