Draft workshop reviews - Githubissues

DawnmarieDesJardins commented 5 years ago

Please leave your review comments for our authors here.

midomsft commented 5 years ago

"All event data is stored written to an Azure Databricks for PostgreSQL managed Hyperscale (Citus) database" should read as "All event data is stored written to an Azure DATABASE for PostgreSQL managed Hyperscale (Citus) database" - Instructor Guide

joelhulen commented 5 years ago

@midomsft good catch. Fixed :)

thesqlpro commented 5 years ago

I'm assuming you want us to skip sections that say things like: [insert your custom workshop content here . . . ]

thesqlpro commented 5 years ago

Diagrams in the student and instructor guides are clear, and very nicely put together. Thumbs up!

My only question is around why Data Lake is not specified as Gen1 or Gen2.

waltermyersiii commented 5 years ago

Correct "Postres" to "PostgreSQL" in WDS deck, slide 19. :-)

julienmichel commented 5 years ago

In the hands-on lab, Task 6: Deploy Azure Database for PostgreSQL : you need to specify which type of PostgreSQL instance they need to choose : Single server or Hyperscale (Citus) server group - PREVIEW.

thesqlpro commented 5 years ago

In the prep for the lab, this part confused me: Create a variable to hold your resource group name. This will be used when creating other resources. Replace SUFFIX with your Microsoft alias, initials, or another value to ensure uniqueness. resourcegroup=hands-on-lab-SUFFIX Create a variable to hold your resource group location name. Replace the westus location with a location closest to you. This same location will be used when provisioning other Azure resources. location=westus For a list of valid location names, execute: az account list-locations -o table Enter the following command to create a subscription group. az group create --name $resourcegroup --location $location

Instead of creating the variables, because it didn't work, I just hardcoded the values when creating the resource group. I've never used Bash before, so I probably did something wrong with the variable creation. But what I did was just define the variable like this and press enter:

variablename = value

I went back to try it again, and did not include the spaces in my variable definition. I think it is worthwhile to put a note in there about that as many of us come from different programming languages and are used to that. Anyway, it works now.

julienmichel commented 5 years ago

Try without spaces.

midomsft commented 5 years ago

Hey, just completed the "Before the HOL" section - all works fine, great job :-)

One question - on task 8 it says "From your LabVM, download the starter files by downloading a .zip copy of the Cosmos DB real-time advanced analytics GitHub repo..."

Did I miss a step where you have to create a LabVM? I didn't notice another prerequisities section at all (could be me of course!) - does a user need to create a VM or can they use their own laptop?

julienmichel commented 5 years ago

Is it normal that task 8 ask to redirect to "Cosmos DB real-time advanced analytics GitHub repo" ?

julienmichel commented 5 years ago

In the exercise 1, I was not able to connect to the instance because of the firewall. I needed to add a rule to allow all the IPs to connect (0.0.0.0 to 255.255.255.255) because it was not recognizing my public IP neither my private IP. I'm connecting through the MS corp network.

midomsft commented 5 years ago

Exercise 3 (Kafka setup) for the Hands on Lab instruction says -

"Within the folder, open the appconfig.json file in a text editor, such as Notepad."

However the actual file (and screenshot in the manual) show appsettings.json

Probably needs updating so instructions refer to correct file.

midomsft commented 5 years ago

kafkaerror

Hey, reaching a breaking error in exercise 3. I've changed the appsettings.json, but when I try and run KafkaProducer.exe I get the error attached. Should there be a lib/[dlls] folder included in the app source or have I missed a preq somewhere? I've looked but can't see anything.

I'm running the windows version of the app.

joelhulen commented 5 years ago

@midomsft Sorry about that. I've just committed an update that includes all the missing files if you want to try downloading the new version and try again.

midomsft commented 5 years ago

Thanks @joelhulen that's perfect, now working :-)

midomsft commented 5 years ago

Ok completed both the instructor guide and hands on lab walk through now. Aside from a couple of issues (responded to above) this all went smoothly. Excellent job!

joelhulen commented 5 years ago

Great, and thank you! I will address all non-breaking issues once the review process is complete.

thesqlpro commented 5 years ago

I'm going to skip this task and come back to it when I get an answer, but I know other things will not happen unless it is fixed. Details below.

Task 3 Step 8 giving me an insuffienct privelages error:

ayman@Azure:~$ az ad sp create-for-rbac -n "wwi-oss-sp" --role "Storage Blob Data Contributor" --scopes /subscriptions/XXXXXXXXXXXXXXXXXXXXXX/resourceGroups/hands-on-lab-ayelghaz/providers/Microsoft.Storage/storageAccounts/wwiadlsayelghaz Changing "wwi-oss-sp" to a valid URI of "http://wwi-oss-sp", which is the required format used for service principal names Found an existing application instance of "2b7a55ff-62f9-49d6-8324-c236c362c9a2". We will patch it Insufficient privileges to complete the operation.

joelhulen commented 5 years ago

@thesqlpro Are you using your own pay-as-you-go or MSDN-based Azure subscription, or a sponsored subscription provided by Microsoft? Please look at the Requirements listed in the Before the Hands-on Lab document. You must have permissions within your Azure subscription to create an App Registration and service principal within Azure Active Directory.

thesqlpro commented 5 years ago

@joelhulen I'm using our internal MS subscription. I've done app registration before in AAD. I'll look into repeating these steps, but I don't know if my MSDN subscription still works.

thesqlpro commented 5 years ago

@joelhulen I just skipped the step and completed the rest of the lab. I manually inserted data into the Postgres events table to test out my dataframe and the PowerBI report. It wasn't streaming because of that step that failed due to permissions (our exchange above). I will try to reproduce this lab if I can get my MSDN working properly.

Overall, the lab is easy to follow. Clean code. Everything works as expected. I really enjoyed working on this and learning. The comments in the notebook are easy to follow.

Thank you for including me in this!!

samay-sharma commented 5 years ago

Great work @joelhulen . I'm still going through the content but wanted to add in feedback on the trainer / student guides and the outline sections. I'll go through the hands on labs tomorrow and then add feedback on that as a separate comment.

I'm adding comments as checkboxes to make it easy for you to address.

Outline.md file :

[x] Section 3. Another byproduct of outgrowing their database is that WWI is having difficulty efficiently processing and ingesting streaming data, while at the same time generating ~~pre-aggregated data~~ reports for their dashboards. ~~The Postgres engine~~ PostgreSQL is well-suited to handle multiple workloads simultaneously when the databases are properly configured and you are able to appropriately scale up or scale out to multiple nodes. WWI does needs help optimizing their database to handle these demanding workloads at scale. They have looked moving to a non-relational database to speed up queries, but that option added too much complexity to manage multiple databases, losing the ability to wrap their operations inside of transactions, re-architect their application, and migrate their historical data. In addition, they rely on Postgres' ability to create complicated ways of representing and indexing their data, which is impossible to do with a column store. Their need for high transaction volume and a real-time data set ruled out a lot of off-the-shelf data warehouses, where they would need to create a lambda architecture to handle both speeds of feeds.

Reason: Pre-aggregated data seems a bit too specific for this stage of the discussion.

[x] Section 5: Change header to advanced dashboards.
[x] ~~They would like a way to more rapidly create reports and be able to display them on a dashboard that can be customized and show real-time updates.~~ They would like a way to create visually rich reports and be able to display them on a dashboard that can be customized and show real-time updates.

WDS Trainer Guide

Preferred Solution

https://github.com/solliancenet/MCW-Managed-open-source-databases-on-Azure/blob/master/Whiteboard%20design%20session/WDS%20trainer%20guide%20-%20Managed%20open%20source%20databases%20on%20Azure.md#preferred-solution

Scale

[x] 1. Paragraph 2: Because of this, the query engine parallelizes and routes incoming SQL queries across these servers for faster response times on large datasets.
[x] 3. Partitioning is the key to high performance ~~and being able to scale out across several database nodes~~ as it allows you to break up data into further smaller chunks based on time windows. One of the keys to fast data loading is to avoid using large indexes. Traditionally, you would use block-range (BRIN) indexes to speed up range scans over roughly-sorted data. However, when you have unsorted data, BRIN indexes tend to perform poorly. Partitioning helps keep indexes small. It does this by dividing tables into partitions, avoiding fragmentation of data while maintaining smaller indexes. In addition, it allows you to query only a smaller portion of the data when you run queries for particular time windows leading to faster SELECT performance.

Process data while generating roll-ups

[x] 1. I'd add some more content to pg_cron. Maybe add this after WWI will create rollup aggregation functions that can be scheduled to run on a periodic basis using pg_cron. Part to add: pg_cron is a simple cron-based job scheduler for PostgreSQL that runs inside the database as an extension. That way, you don’t need to have a separate server to schedule cron jobs within your database. As it uses the same syntax as regular cron, it simple to schedule execution in a number of ways, such as minutely, hourly, daily, weekly, monthly, day of week, etc.
[x] 3. We are using ~~hll_has_bigint~~hll_hash_bigint to hash the HLL columns device_id and session_id

Resilient stream processing

[x] In 2. The spark section seems a bit orthogonal to me. The question is more about the stream processing pipeline and how it can reduce errors and prevent data loss. I think the paragraph describing spark seems to fit better in 3 in my opinion.

Checklist of preferred objection handling

https://github.com/solliancenet/MCW-Managed-open-source-databases-on-Azure/blob/master/Whiteboard%20design%20session/WDS%20trainer%20guide%20-%20Managed%20open%20source%20databases%20on%20Azure.md#checklist-of-preferred-objection-handling

[x] 1. The new Azure Database for PostgreSQL Hyperscale (Citus) option allows you to configure scaling up and scaling out the ~~coordinator and worker~~ nodes of your database cluster.

I don't think we need to get into coordinator and workers at this stage.

[x] 2. The sharding logic is automatically handled for you by the Hyperscale (Citus) server group.
[x] 2. All of this can be done with minimal schema ~~changes, and potentially no~~ and code changes.
[x] 4. The Hyperscale option uses Citus to provide ~~the distributed data storage capabilities like sharding and various partitioning plans.~~ capabilities like sharding, query routing and query parallelization. However, it ~~was~~is built ~~to be~~as an extension of PostgreSQL...

Customer quote

Migrating our PostgreSQL databases to Azure with the Hyperscale ~~server group~~ (Citus) deployment option...

A few general comments

Make sure product name is correct and consistent everywhere i.e. Hyperscale (Citus).
Make sure that it is called as a deployment option for Azure Database for PostgreSQL.

elsantorini commented 5 years ago

@joelhulen, really well done! It's straight forward with no pitfalls. Just a typo in the pre-lab, where the title Task 3 is used twice.

miyamam commented 5 years ago

Preparation:

We don't need to regular Premium tier for HOL (Use the trial Premium)
Do we really need Citus (Hyperscale)? (Single instance is enough to try)
PowerBI needs to install "NPGSQL ADO.NET provider". https://docs.microsoft.com/en-us/power-bi/desktop-data-source-prerequisites

elsantorini commented 5 years ago

@miyamam, In the trainer guide it explains what were the customer's struggles, and it says that they have multiple tenants, and that they need to shard the Database to make queries run faster.

miyamam commented 5 years ago

@elsantorini But current WWI's data size is so small that single instance is enough.

joelhulen commented 5 years ago

@miyamam I agree about data size, but since the Citus-specific capabilities of the Hyperscale version are the things being highlighted and used in the lab, it is a requirement for successful completion of the lab, and for the scenario. We could provide a much larger test database, but that would be an unnecessarily large dependency for the lab.

samay-sharma commented 5 years ago

So, I went through the prep for HoL and had the following comments.

Hyperscale (Citus) is available only in 4 regions today. East US 2, West US 2, West Europe, SouthEast Asia. Should we only bring those up as options in the prep section?
After Create, you get redirected to a screen which has both single server and Hyperscale (Citus) deployment options. We need to add instructions to choose Hyperscale (Citus).
In the section about starter files, we need to change that according to this lab (points 1,2 and others). The current dataset is not about payment data and also the technology should be open source databases.
I have a MacBook so I couldn't install the PowerBI desktop client. Any ideas on how I can go through the HoL with a Mac?

joelhulen commented 5 years ago

@samay-sharma I've promised to hold off on commenting until after the review ends, but you bring up some good points (and you're currently blocked).

I will make note of this in the Before the HOL instructions when selecting a region
Already have in my notes to add a screenshot and instructions to specifically choose Hyperscale (Citus)
That was something I copied from my Cosmos DB MCW (and forgot to update the wording)
You can download a Windows 10 Workstation Image for VirtualBox, Hyper-V, VMWare, or Parallels for free here. If that works for you, I'll add that as an option in the lab. We could also direct people to provision a Windows 10 VM on Azure to do the lab.

samay-sharma commented 5 years ago

Thanks @joelhulen . So, I tried to complete the Hands on Lab but had a few issues:

While executing Task 3, step 8, I got Insufficient privileges to complete the operation. It looks like my subscription does not have permissions within your Azure subscription to create an App registration and service principal within Azure Active Directory to complete this task. Is that correct? If so, I'll look to get this addressed on my end.
While trying to create a Lab Cluster as per Exercise 2, Task 5, I got an error. The error was:

Time
2019-06-26 17:01:01 PDT
Message
Cluster terminated. Reason: Cloud Provider Launch Failure

A cloud provider error was encountered while launching worker nodes. See the Databricks guide for more information.

Azure error code: ServerUnreachable

Azure error message: Azure API server unreachable, please try again later.

I don't know if this is something I did or if it is an issue. Can you help me with this?

Lastly, I got a Windows VM up and running on VMWare but wasn't able to complete the PowerBI step today. I'll look to get that done by tomorrow. I was not able to load data in because of the above 2 issues so most likely would have got just empty charts anyway :).

For the Postgres / Citus pieces, I was able to work through all of them (just without data) and have a few inputs. I'll put those in my next comment.

samay-sharma commented 5 years ago

Exercise 1 - Task 2: We have a description for partitioning which is same as what we have in the Trainer Guide. Can you update this according to feedback there as well?
Task 3: We run create_distributed_table in task 2 but don’t describe it. I think create_distributed_table should be in the Shard your tables section and not in task 2.

I also think we should include 2 lines to explain what create_distributed_table does. We can say “ that we are sharding each of the tables on customer_id column. “ This is done by calling create_distributed_table function. When you run this function, Citus inserts metadata marking the table as distributed and creates shards on the worker nodes. Then incoming data into these tables is routed to the right node based on customer id.

Then, we can continue with:

Because we are sharding on the same ID for our raw events table and rollup tables, our data stored in both types of table are automatically co-located for us by Citus....

Task 4: Create rollup tables

We should also create a separate section for create rollup tables.

We should also explain why rollup tables are useful. Something like:

In this task, you will create two rollup tables for storing aggregated data pulled from the raw events table. Rollup tables allow you to pre-aggregate data for certain commonly run queries so that you don’t need to aggregate the data repeatedly during query time. This allows you to have longer retention, compression of results and faster query time by reducing the amount of data to scan.

The rollup tables you will create are...

Exercise 2

Task 5: I didn't get Select Runtime: 5.3 (Scala 2.11, Spark 2.4.0) as an option. Therefore, I selected 5.3 (Scala 2.11, Spark 2.4.1). Is that fine?

Exercise 4:

This section has some good content about rollup tables. I'm thinking if we should move the creation of rollup tables in this section. That’ll make sure people understand why we are creating rollup tables, why we are colocating them and how we are planning to do the rollup and deal with delayed data all in one section. Curious on what your thoughts are?
In paragraph 5, Do we need to cover the approaches we are not picking? I think the information there is pretty brief and most users won’t understand what we are trying to say from that much content.

So, maybe we can do something like this:

You still need to be able to keep track of which events have already been aggregated. We will do that by tracking sequence number. There’s a few other approaches to do this and you can read more about why this approach is recommended here (link to blog post).

Task 2:

Point 5. Change the query you have here to : hll_cardinality(hll_union_agg(device_distinct_count))::bigint or ceil(hll_cardinality(...)). Customers typically do that because number of distinct devices will generally be a whole number and its confusing if double comes up as the output.

A few comments on the queries we have in task 2 and structuring.

For queries 5 to 9, can we bucket together queries which run across shards (i.e. ones which don’t have customer id) and queries which have customer id filter in them.

If we do that, we can add a description saying: “The following queries don’t have customer id in the filter so these queries will be executed in parallel across all different nodes in the cluster leading to fast query performance.

For the queries with the filter, we can say “As these queries have a filter on customer_id, Citus will route the queries to only the node which has the data for that particular customer without needing to touch data for the remaining customers. This leads to faster performance as you need to scan only a small portion of the data.

To illustrate these better, I think we should do 1 EXPLAIN example each for a single-shard query and a cross shard query to make sure they understand these concepts and we can clearly show how one of them has 1 task and how another has 32.
I also think we should run a query against the raw table trying to compute the same aggregate and one against the hourly rollup table query we have here and demo that querying the rollup table was much faster by showing query runtime in pg_admin. That'll help them understand and actually see the benefit of rollup tables.

samay-sharma commented 5 years ago

@joelhulen : That would be my first round of feedback focussed primarily on the Postgres / Citus pieces. There is an important point around me not able to do the spark cluster creation and hence ingestion and the visualization piece. If you can help me fix the spark one (the other one I should be able to fix on my own), I can complete the rest of the lab and provide more complete feedback.

joelhulen commented 5 years ago

@samay-sharma in the pre-requisites listed in Before the HOL, we state the following:

Microsoft Azure subscription must be pay-as-you-go or MSDN.
- Trial subscriptions will not work.
- IMPORTANT: To complete the OAuth 2.0 access components of this hands-on lab you must have permissions within your Azure subscription to create an App Registration and service principal within Azure Active Directory.

If you are using a Microsoft-provided Azure subscription, then you will most likely not be able to complete the OAuth 2.0 access components.

In Task 3, Step 8, you need to make sure you've replaced the -scope value with your own. Again, if you did then it may have to do with your permissions as described above.

I am not sure why you cannot create a cluster. That seems like a platform-related error. Maybe try again later? Worst case, delete the Databricks workspace and re-create it.

joelhulen commented 5 years ago

@samay-sharma Were you ever able to complete the lab, given the information above? Today I plan on applying changes to the MCW based on feedback here, since feedback was officially closed EOD Wednesday and I need to finalize today. Please let me know soon if you have additional feedback or if you could not complete for some reason.

samay-sharma commented 5 years ago

@joelhulen : I haven't. The databricks cluster creation still hasn't worked. I retried creating the cluster and still gives me the same error.

I'm not sure how to do the Oauth 2.0 access components. I don't have any other subscription other than my Microsoft one. So, not sure how to take care of that.

The PowerBI stuff is fine.

I think my major feedback has been around the Citus pieces and it looks like others were able to do the lab successfully, so this seems something specific to my account.

joelhulen commented 5 years ago

I have addressed all the feedback and made a PR to the official Microsoft repo: https://github.com/microsoft/MCW-Real-time-data-with-Azure-Database-for-PostgreSQL-Hyperscale

Thank you so much, everyone, for taking your time to review this MCW. It was great working with you!

solliancenet / MCW-Managed-open-source-databases-on-Azure

Draft workshop reviews #2

Outline.md file :

WDS Trainer Guide

Preferred Solution

Scale

Process data while generating roll-ups

Resilient stream processing

Checklist of preferred objection handling

Customer quote

A few general comments