Closed uakbr closed 2 years ago
Language: english
Transcription: Hello everyone, welcome to AWS Tutorials. In AWS Tutorials, we provide workshops and exercises to learn about AWS services. And these workshops and exercises are published to our website, aws-dozer.com. You can follow these workshops and exercises to implement certain scenarios and learn about AWS services. Today we are going to talk about how you can use Kinesis delivery stream to transform data using Glue. So let's get some introduction about this topic first. So if you are building a delivery stream using Kinesis data file host, then you can do transformation in two ways. One is using Lambda function and second is using Glue data catalog. So if you are not aware of Kinesis, then just to give you a short introduction, Kinesis is used to ingest a high volume of streaming data inside AWS. Kinesis uses partition method, shard method, and you can use multiple shards to enable high volume data ingestion. Once data gets ingested using Kinesis data stream, then in order to deliver this data to a certain destination, you can use Kinesis data file host to create a delivery stream to deliver it to destinations like S3 or Redshift, or even third party tools like Splunk. When you are trying to deliver your data through delivery stream, sometimes there is a requirement that you want to transform the data. When you want to transform the data, there are two methods available to transform using Lambda function or using Glue data schema. The primary difference is that when you are using Lambda function, then you can do any kind of transformation. You can do standard transformation like you want to convert CSV into JSON or JSON into some other format. You can do that in Lambda function. You can do any kind of custom transformation there because at the end of the day you are writing your own code. So you can transform the way you want to transform the data. You are using Lambda function code to transport the transformation. When you are using Glue data catalog, then actually you are not required to write even a single line of code. You can simply transform your data using schema, using schema conversion. But when you are using Glue data catalog, there are certain transformations which are possible. And today I'm going to focus on Glue data catalog based transformation. Let's focus a bit on that. So if you are trying to transform your data using Glue data based transformation, then it can convert only JSON format into Apache Parquet or Apache ORC format. So your source format has to be JSON and your destination format has to be Apache Parquet or Apache ORC. If you have any other combination, then you should be using Lambda function for that kind of transformation. But again, you have to write the code. Now, sometimes you might have a scenario where your source data is in CSV format or XML format, and you want to convert that into Apache Parquet. If that is the case, then you can very well use this method along with Lambda, where you can use Lambda to convert your XML or CSV source data into JSON format. And then from JSON, you use a Glue transformation to convert into Parquet or ORC format. So you can do that combination as well. But the good part about this method of transformation is that you don't need to write a single line of code. For your transformation, the limitation or you can say the bad part is that it can only convert JSON into Apache Parquet or Apache ORC format. Good. So let's see what. So this is the part I'm going to focus today. And let's see what we are going to implement. So what we are going to implement is we will configure a Kinesis data stream which will be used to ingest data. And the data will be ingested using a Cloud9 based Kinesis client where we simply ingest data in the JSON format. Then we will use Kinesis Firehouse and we'll create a delivery stream over here to convert JSON into Parquet and store it in S3 as a destination. And in order to convert JSON into Parquet, we will use a Glue catalog. So this is what we are going to build today. So there is a workshop published to our website aws-dojo.com which we use to follow step by step instructions to complete this scenario and learn about Kinesis and Glue and the services involved in this whole implementation. So the URL of the workshop has been provided in the description box. You can follow that. But anyway, I'm going to show you the steps involved in implementing this scenario. So let's go to the website. So here is our website where we have published this particular workshop. Here is a small introduction about the workshop. Then you have a link called Start the Workshop. And here you can start the workshop. Now when you start the workshop, it shows you the steps you need to perform to complete the workshop. And there are total seven steps involved. Let's go through those steps one by one. So very first step is very straightforward. It says that you have to have an AWS account in order to complete the workshop. And if you don't have that, you might want to create a trial account. Then next step is to create a role. So Kinesis is going to call so many services. It will call Glue to face the schema detail for the transformation. It's going to call S3 bucket to store the data as a destination. So we are going to create a role which will be used by the Kinesis Firehose to perform these activities. So let's go and create an IAM role. So you go to the IAM manager and you create a role. I'm trying to give you a power of your access just to simplify things. But yeah, if you're going for actual production implementation, then you want to go for very specific permissions. Like permission to write to certain buckets only and permission to perform only certain get operations on the glue only. Those kind of things. But I'm keeping it simple for now. So you create this Dozo Kinesis role and you save it. Once it is done, the next step is to create the destination and also create a data schema in the glue, which will be then used by Kinesis delivery system, delivery stream to do the transformation. So first step is that you go ahead and create an S3 bucket. So we are creating S3 bucket called Dozo Kinesis destination. And yeah, you're simply creating the bucket. And then after that, we go to the lake formation and we will create a database and data table and a schema inside the table, which is then used by the Kinesis data stream for the transformation. So first you create a database and we are into lake formation and we name this database Dozo database. Once we have created this database, then we simply go and create a table called Dozo table. Keep it simple. And then once the table is created, then we go and say, I want to I have to mention in what format this table will be. So we're saying the format will be JSON format. And then you have to also provide the schema details. So I mean, you're creating a table of yours. You give a table name. Then you have to mention where this table will have the destination, say, our location in the S3 bucket, what data format it will have, which is JSON in this case. And then it will also have to also provide the schema details. So here is the schema detail I created. So there are three three columns we are going to have in this JSON input or JSON input into input data. It will have first name, last name, age, string, string and integer type. So once I show you the message format letter in the exercise in the workshop, you will you will able to map it to here. But here I'm saying that, hey, I'm going to I'm going to ingest the data in JSON format. And if you're just in format, we'll have in the schema, we'll have three columns, first name, last name and age of string, string and integer type. Once you have done this schema configuration and so far you have seen I have not given a single line of code. So now we go and create a Kinesis data stream and delivery stream. So we go to the Kinesis console. First, we create a Kinesis data stream, which is used to ingest the data. So we simply give it a name called dojo stream and then just keep one short, which is good enough for the exercise. And then we simply create the data stream. So once the data stream has been created, then we go and create a delivery stream. And we give this name called dojo delivery stream. And then we move on and say this delivery stream is going to pick data from the Kinesis data stream where we will actually ingest the data. And then I say that I want to do the format. I want to record format conversion. I want to enable it. That means I want to convert my record and then I want to convert my record into Apache Parquet format. And then at this point of time, so I'm not using the I mean, if you go and see the screen, you'll find there's a method available right before this method to convert using lambda function, which we are not using here. We're using the other method where we use the glue. So I'm saying, OK, I'm enabling for record format conversion and then I want to convert into Apache Parquet. And then it will say, OK, if you want to convert to Apache Parquet, do you mind saying the schema details? So I'm saying my schema is defined to island region in the glue, in the dojo database, glue database, in the dojo table and take the latest version from there. So this is the table and database I created in the earlier step. And once I have done that, then I have to simply say, hey, where is my destination where I store my data? I say my destination is in S3 bucket and here is dojo kinesis destination. I configure it. I'm also providing some buffering details. So, for instance, I want to wait for 60 seconds before 60 seconds of buffering up the record before it gets delivered to the destination. And then I also selected the role which we have created earlier to give Kinesis authorization to perform different type of operations. So that's all. And then we simply go and create the delivery system. So my Kinesis side of configuration is ready. Now let's work on the client. So in order to create the client, first we go and set up a cloud nine environment and we are using cloud nine environment as an environment to build Kinesis client. So we'll simply create an environment using a very small T2 micro machine so that we can use the AWS free tier users. And then we'll simply go and deploy Python Boto3 SDK there so that we can do Python programming with Kinesis. So we simply configure the environment pretty straightforward steps. And after that, we simply create and run the client. So let's create the client. So we create a new file and this is the code of the client. So we are simply saying that I'm creating a client, Kinesis client. And then this is the data I want to send. Now you will be able to relate it to what schema I define. So in schema I define first name, last name, age as three fields. I set the format is JSON. And this is the message I'm sending, which is matching to that schema. And this file, JSON document or JSON input will be converted into a parquet format through the delivery stream configuration. So I'm simply, this is my message. This is my partition key. I'm simply generating a random partition key. And then I'm simply using this put record method to send record one by one. I could have used put records multiple, but I said let's keep it simple for now. I'm simply saying to this dojo stream, send my data. OK, now when you do that, and again, you can see that this JSON document and this format matches here. So once you send, once you have created this client file, you might want to save it. And we're saving it like Kinesis client dot Python file. And then we simply run it. And here we run the file, which has simply put the record inside the Kinesis data stream, which will then be picked by the delivery stream to transform and then write to the S3 bucket. So we wait for 60 seconds because that's the buffering time we gave. So buffering time is 128 MB of 60 seconds. So it will take a lot of time to give 128 MB of data. So let's wait for 60 seconds. And after that, if you go and check your destination in S3 bucket, you can see that your file has been transformed and delivered over there. And if you try to open this file, you can see that it appends actually parquet extension over there. And then you can see this is a parquet format, which is compressed columnar format and not available if you try to open into a normal text editor. So there is a nice tool, this parquet viewer online tool, which you can use to like open this kind of parquet file. So if you try to use that tool and try to open over there, you can see here your data has been fetched over here. So that very much finishes the workshop. Now the next step is to clean up the resources you created so that you don't end up paying any service cost for this workshop. So this was all about this workshop where you can see how you can use kinesis and the blue to white writing even a single line of code to transform your JSON data into Apache parquet or apache format. Hope you like this video. And if you like, please click on the like button. Please subscribe to our channel. If you have any feedback about our workshop and exercises, either you can provide feedback in the YouTube comment channel or you can also reach out to us through this contact us back. There are many other exercises and workshops like this one in our AWS-Dojo website to learn about AWS services. I recommend you explore those exercises, the one you like and implement a scenario like this to learn about AWS services. That's all for today. Hope you like this video. Thank you very much for your time and have a nice day. Bye bye.
Translation: Hello everyone, welcome to AWS Tutorials. In AWS Tutorials, we provide workshops and exercises to learn about AWS services. And these workshops and exercises are published to our website, aws-dozer.com. You can follow these workshops and exercises to implement certain scenarios and learn about AWS services. Today we are going to talk about how you can use Kinesis delivery stream to transform data using Glue. So let's get some introduction about this topic first. So if you are building a delivery stream using Kinesis data file host, then you can do transformation in two ways. One is using Lambda function and second is using Glue data catalog. So if you are not aware of Kinesis, then just to give you a short introduction, Kinesis is used to ingest a high volume of streaming data inside AWS. Kinesis uses partition method, shard method, and you can use multiple shards to enable high volume data ingestion. Once data gets ingested using Kinesis data stream, then in order to deliver this data to a certain destination, you can use Kinesis data file host to create a delivery stream to deliver it to destinations like S3 or Redshift, or even third party tools like Splunk. When you are trying to deliver your data through delivery stream, sometimes there is a requirement that you want to transform the data. When you want to transform the data, there are two methods available to transform using Lambda function or using Glue data schema. The primary difference is that when you are using Lambda function, then you can do any kind of transformation. You can do standard transformation like something like you want to convert CSV into JSON or JSON into some other format. You can do that in Lambda function. You can do any kind of custom transformation there because at the end of the day, you are writing your own code. So you can transform the way you want to transform the data. And you are using Lambda function code to transport the transformation. When you are using Glue data catalog, then actually you are not required to write even a single line of code. You can simply transform your data using schema, using schema conversion. But when you are using Glue data catalog, there are certain transformations which are possible. And today I'm going to focus on Glue data catalog based transformation. Let's focus a bit on that. So if you're trying to transform your data using Glue data based transformation, then it can convert only JSON format into Apache Parquet or Apache ORC format. So your source format has to be JSON and your destination format has to be Apache Parquet or Apache ORC. If you have any other combination, then you should be using Lambda function for that kind of transformation. But again, you have to write the code. Now, sometimes you might have a scenario where your source data is in CSV format or XML format, and you want to convert that into Apache Parquet. If that is the case, then you can very well use this method along with Lambda, where you can use Lambda to convert your XML or CSV source data into JSON format. And then from JSON, you use a Glue transformation to convert into Parquet or ORC format. So you can do that combination as well. But the good part about this method of transformation is that you don't need to write a single line of code. For your transformation, the limitation or you can say the bad part is that it can only convert JSON into Apache Parquet or Apache ORC format. Good. So let's see what. So this is the part I'm going to focus today. And let's see what we are going to implement. So what we are going to implement is we will configure a Kinesis data stream which will be used to ingest data. And the data will be ingested using a Cloud9 based Kinesis client where we simply ingest data in the JSON format. Then we will use Kinesis Firehouse and we'll create a delivery stream over here to convert JSON into Parquet and store it in S3 as a destination. And in order to convert JSON into Parquet, we will use a Glue catalog. So this is what we are going to build today. So there is a workshop published to our website aws-dojo.com which we use to follow step by step instructions to complete this scenario and learn about Kinesis and Glue and the services involved in this whole implementation. So the URL of the workshop has been provided in the description box. You can follow that. But anyway, I'm going to show you the steps involved in implementing this scenario. So let's go to the website. So here is our website where we have published this particular workshop. Here is a small introduction about the workshop. Then you have a link called Start the Workshop. And here you can start the workshop. Now when you start the workshop, it shows you the steps you need to perform to complete the workshop. And there are total seven steps involved. Let's go through those steps one by one. So very first step is very straightforward. It says that you have to have an AWS account in order to complete the workshop. And if you don't have that, you might want to create a trial account. Then next step is to create a role. So Kinesis is going to call so many services. It will call Glue to face the schema detail for the transformation. It's going to call S3 bucket to store the data as a destination. So we are going to create a role which will be used by the Kinesis Firehose to perform these activities. So let's go and create an IAM role. So you go to the IAM manager and you create a role. I'm trying to give you a power of your access just to simplify things. But yeah, if you're going for actual production implementation, then you want to go for very specific permissions. Like permission to write to certain buckets only and permission to perform only certain get operations on the glue only. Those kind of things. But I'm keeping it simple for now. So you create this Dozo Kinesis role and you save it. Once it is done, the next step is to create the destination and also create a data schema in the glue, which will be then used by Kinesis delivery system, delivery stream to do the transformation. So first step is that you go ahead and create an S3 bucket. So we are creating S3 bucket called Dozo Kinesis destination. And yeah, you're simply creating the bucket. And then after that, we go to the lake formation and we will create a database and data table and a schema inside the table, which is then used by the Kinesis data stream for the transformation. So first you create a database and we are into lake formation and we name this database Dozo database. Once we have created this database, then we simply go and create a table called Dozo table. Keep it simple. And then once the table is created, then we go and say, I want to I have to mention in what format this table will be. So we're saying the format will be JSON format. And then you have to also provide the schema details. So I mean, you're creating a table of yours. You give a table name. Then you have to mention where this table will have the destination, say, our location in the S3 bucket, what data format it will have, which is JSON in this case. And then it will also have to also provide the schema details. So here is the schema detail I created. So there are three three columns we are going to have in this JSON input or JSON input into input data. It will have first name, last name, age, string, string and integer type. So once I show you the message format letter in the exercise in the workshop, you will you will able to map it to here. But here I'm saying that, hey, I'm going to I'm going to ingest the data in JSON format. And if you're just in format, we'll have in the schema, we'll have three columns, first name, last name and age of string, string and integer type. Once you have done this schema configuration and so far you have seen I have not given a single line of code. So now we go and create a Kinesis data stream and delivery stream. So we go to the Kinesis console. First, we create a Kinesis data stream, which is used to ingest the data. So we simply give it a name called dojo stream and then just keep one short, which is good enough for the exercise. And then we simply create the data stream. So once the data stream has been created, then we go and create a delivery stream. And we give this name called dojo delivery stream. And then we move on and say this delivery stream is going to pick data from the Kinesis data stream where we will actually ingest the data. And then I say that I want to do the format. I want to record format conversion. I want to enable it. That means I want to convert my record and then I want to convert my record into Apache Parquet format. And then at this point of time, so I'm not using the I mean, if you go and see the screen, you'll find there's a method available right before this method to convert using lambda function, which we are not using here. We're using the other method where we use the glue. So I'm saying, OK, I'm enabling for record format conversion and then I want to convert into Apache Parquet. And then it will say, OK, if you want to convert to Apache Parquet, do you mind saying the schema details? So I'm saying my schema is defined to island region in the glue, in the dojo database, glue database, in the dojo table and take the latest version from there. So this is the table and database I created in the earlier step. And once I have done that, then I have to simply say, hey, where is my destination where I store my data? I say my destination is in S3 bucket and here is dojo kinesis destination. I can't figure it. I'm also providing some buffering details. So, for instance, I want to wait for 60 seconds before 60 seconds of buffering up the record before it gets delivered to the destination. And then I also selected the role which we have created earlier to give Kinesis authorization to perform different type of operations. So that's all. And then we simply go and create the delivery system. So my Kinesis side of configuration is ready. Now let's work on the client. So in order to create the client, first we go and set up a cloud nine environment and we are using cloud nine environment as an environment to build Kinesis client. So we'll simply create an environment using a very small T2 micro machine so that we can use the AWS free tier users. And then we'll simply go and deploy Python Boto3 SDK there so that we can do Python programming with Kinesis. So we simply configure the environment pretty straightforward steps. And after that, we simply create and run the client. So let's create the client. So we create a new file and this is the code of the client. So we are simply saying that I'm creating a client, Kinesis client. And then this is the data I want to send. Now you will be able to relate it to what schema I define. So in schema I define first name, last name, age as three fields. I set the format is JSON. And this is the message I'm sending, which is matching to that schema. And this file, JSON document or JSON input will be converted into a parquet format through the delivery stream configuration. So I'm simply, this is my message. This is my partition key. I'm simply generating a random partition key. And then I'm simply using this put record method to send record one by one. I could have used put records multiple, but I said let's keep it simple for now. I'm simply saying to this dojo stream, send my data. OK, now when you do that, and again, you can see that this JSON document and this format matches here. So once you send, once you have created this client file, you might want to save it. And we're saving it like Kinesis client dot Python file. And then we simply run it. And here we run the file, which has simply put the record inside the Kinesis data stream, which will then be picked by the delivery stream to transform and then write to the S3 bucket. So we wait for 60 seconds because that's the buffering time we gave. So buffering time is 128 MB of 60 seconds. So it will take a lot of time to give 128 MB of data. So let's wait for 60 seconds. And after that, if you go and check your destination in S3 bucket, you can see that your file has been transformed and delivered over there. And if you try to open this file, you can see that it appends actually parquet extension over there. And then you can see this is a parquet format, which is compressed columnar format and not available if you try to open into a normal text editor. So there is a nice tool, this parquet viewer online tool, which you can use to like open this kind of parquet file. So if you try to use that tool and try to open over there, you can see here your data has been fetched over here. So that very much finishes the workshop. Now the next step is to clean up the resources you created so that you don't end up paying any service cost for this workshop. So this was all about this workshop where you can see how you can use kinesis and the blue to white writing even a single line of code to transform your JSON data into Apache parquet or apache format. Hope you like this video. And if you like, please click on the like button. Please subscribe to our channel. If you have any feedback about our workshop and exercises, either you can provide feedback in the YouTube comment channel or you can also reach out to us through this contact us back. There are many other exercises and workshops like this one in our AWS-Dojo website to learn about AWS services. I recommend you explore those exercises, the one you like and implement a scenario like this to learn about AWS services. That's all for today. Hope you like this video. Thank you very much for your time and have a nice day. Bye bye.
URL
https://www.youtube.com/watch?v=5HnYFaVv5m8