uakbr / whispe2.0

Apache License 2.0
0 stars 0 forks source link

Transcribe audio #5

Closed uakbr closed 1 year ago

uakbr commented 1 year ago

URL

https://www.youtube.com/watch?v=OsJvgTmeyeE

github-actions[bot] commented 1 year ago

Language: english

Transcription: Hello, everyone. Welcome to AWS Tutorials. In AWS Tutorials, we provide workshops and exercises to learn about AWS services. But today I'm going to talk something different. I am working for a couple of customers, and for one of the customers, we are building Data Lake. Recently, we have been arguing about whether we should keep our data in CSV format or Parquet format. We are doing the resource and everything. We came across a couple of key differentiator and points. When I learned about those things, I said, why not do a little experiment and see? Because once you see, you believe it. That's the experiment I'm sharing with you. If you are struggling with the same question, then you can do this experiment yourself and make more, I will say, a knowledgeable decision about which format to use. The two key differentiations which we came across, one was that ParaQuest takes a smaller storage compared to CSV. We're not sure how much smaller it will be, but definitely we learned that ParaQuest used a compressed format and column-based storage, and it is much smaller than CSV. The second argument which you saw was that, since ParaQuest is column-based storage, the amount of data you scan for the query is less. That's a very interesting thing because in CSV, we know that the data is stored in a table format, row column format, but ParaQuest has a different, altogether column-based storage. People say that if you go for that kind of storage, then if you're trying to run some aggregation functions and so on and so forth, it only involves those columns which you are using in your query, and your total amount of data scanned is less. We said, okay, sounds good to me. To be honest, these two have quite an implication in AWS from the cost and performance point of view. Taking in the first point, if ParaQuest takes less storage than CSV, then definitely it is going to cost us less in AWS, because if I'm storing data inside S3, then I'm paying for less storage cost. For the second point, this claim that ParaQuest gives a better query performance because it is a column-based storage. That's a good news. The second thing is that since in ParaQuest, the amount of data you scan is less, it again gives a better cost in AWS. Because if you're using a solution like S3 and Athena for the query, or Athena driver with your reporting tools for the query, then Athena or S3 are charged based on the amount of data you are scanning. If the amount of data I scanned is less, then you are paying less. Definitely, it seems that ParaQuest is going to give me a better cost performance if the amount of data I scanned is less. But how much? Again, I don't know. The biggest question was that, let's visualize these things. Can we see these things happening, and then make decision based on that? Seeing is believing. We said, okay, let's do a small experiment in AWS, and based on that experience, we'll try to make a more knowledgeable data point based decision. I'm going to share you that experiment. If you want to do similar experiment for your own decision, to be honest, it's not complicated, it can be done in no time. Let me go back, go to my experiment. First thing I did is that I went to a website called Kaggle.com. If you're using Kaggle.com, actually, you can get a variety of sample data over here. Here in the Kaggle.com, actually, the data I used was for a FIFA 20 complete player dataset, which is roughly 48 MB of the data, which is spread across these six CSV files. I said, okay, let's take this as a sample data to work with. If you try to aggregate the different rows across this data, they are roughly 100,000 rows. What I did, the first step is that I created S3 bucket. If I go to my S3 bucket, and let me quickly go to my S3 bucket, and look at my CSV players. You can see that these six files I uploaded over here from Kaggle, they are in CSV format. Look at the size of the table, 7.2 MB to 8.5 MB. This is important. Have a mental note of this, because we'll come back to this comparison. What I did after this, once this data was stored, we went to Lake Formation. In Lake Formation, we created a database first, then we simply cataloged this CSV table. The CSV data you see here over here, that has been cataloged here in the table. If I go and see this table, you can see it is in CSV format, stored in this folder here. Then this is the schema of the data, 20 columns. I have my CSV data ready. Now, I need to convert this CSV data into Parkway. In order to do so, what I did is that I created a glue job. I went to the glue and created a new job. I simply used out-of-box job which converts a CSV to Parkway. If you look at this one, it's pretty configurable job which generates a script for me. If you look at the script over here, I'm reading from CSV player from Dojo database. Then I'm simply transforming and writing it back to S3 at the location, same bucket, but different folders called Parkway players. If I go a little more on the right, it is converting my data into Parkway format. Basically, I'm trying to convert the same data from CSV to Parkway format and store under this folder in my Dojo data bucket. When I run this format, the one this job, it simply did the ETL thing and it converted my CSV into Parkway and it stored the data at a destination. If I go back to my table, if I go to my Dojo data again, and if I try to see the Parkway players folder, you can see those six files has been converted into Parkway format. Now, first observation, which is again very important. Look at the data size now. It is 2.3 to 2.7 MB compared to 7.5 to 8.5. That means your data storage in Parkway has lowered by three times. This is very important because then that means you are paying three times less storage cost if you are storing your data in Parkway format. That was the finding number one. If I go back to my presentation, remember here this presentation first one, that it has implication on the cost. If I try to use this experiment, then basically what we are saying here is that, if I'm using my data in Parkway format, then actually I am paying three times less than CSV format for the storage. That's the one thing. First thing was validated and we got an idea roughly three times it can do. Well, not necessarily in all cases, I'm assuming data depends. But yeah, it looks like it can do quite in comparison. We said, okay, fine. Having done this thing, let's see the query part because the second argument is that, with Parkway being the column level storage, your data is scanned is less. But that's quite an implication in terms of better performance and low cost. Can we check that? We said, yeah, let's do that. What we did, we went to the Athena. In Athena, we have a database and these two tables you can see. We created two similar queries. What these queries do is that simply they aggregate, they take average of the age for nationality of the players. We first try to run this query for the CSV players table, then try to run the same query for Parkway players and try to compare it. Let's do this. Let's run this query first for the CSV. When I run this query, this is the result. You can see that the query took 2.6 seconds and the data is scanned was 48.2 MB. If I look at this, it has scanned the entire dataset. Remember the original data size is 48 MB. It has scanned the entire data size to give me this result. Fair enough, that's how CSV works. Then we said, let's run the same query in Parkway. If I run the same query in Parkway, let's run it. Then what we find is that the query time is 2.77. Looks like little more than what we saw in CSV, but the data is scanned is 170 KB compared to 48 MB. This you can actually see in comparison. If I go to history over here, you can see the comparison of the queries. Here I have done it a couple of times just to test it. You can see here again that when I'm running this query against CSV, then it completes in 2.6 seconds, but your data scan is 48.21 MB. But if I try to run in Parkway, it gives me a little slower performance, but my data scan is only 170 KB. To be honest, this query I have run a couple of times and I find that sometimes one is less, one is more, sometimes it's almost the same. My finding number one, I could not really establish if Parkway can give me a better query performance over CSV. Probably it can, but not in my experiments. Most of the time they were like, yeah, sometimes one is faster, sometimes other is faster, sometimes both are almost the same. I could not really establish if Parkway can give me a better query performance over ASB or other way around. But I could really establish that amount of data scanned in Parkway is way less than the amount of data you scan in CSV. That definitely has a positive impact on cost performance, lowering the cost. Because as I mentioned to you earlier that if you're using query tool like Athena or if you're using reporting tools which use Athena as a driver to query the data or yeah, Glue catalog to query data, then the charges are based on the amount of data you are scanning. You can see here the amount of data scan in Parkway is way less than the amount of data scan you in CSV. Definitely Parkway is going to give you much better cost performance for the query than CSV. However, the runtime is almost same or mixed between these two. That was the little experiments which should give us a good idea in terms of, if you go from CSV to Parkway format, then what kind of effectiveness cost optimization we are looking into. That really help us make up our mind for our customer. If you are struggling with similar kind of struggle at your end, you can do similar experiment. If you want me to create a tutorial for this, I can do that as well, but yeah, let me know. If you like the video, then please click on the like button and please subscribe to our channels. Our channel if you want to learn more about such videos in the coming days. We try to upload at least two to three videos per week. You can always go to our website called aws-dozo.com, where you can see loads of workshops and exercises to implement certain scenario. By doing so, you'll learn about AWS services. That was all for today. Hope you like it. Please do click on the like button if you like the video. I look forward to your feedback and suggestions. If you have any feedback and suggestions, you can provide us on our YouTube channel, or you can also click on this contact us button on our website and can provide us feedback over there. That's all for now. Thank you very much for your time. Have a nice day. Bye-bye.

Translation: Hello, everyone. Welcome to AWS Tutorials. In AWS Tutorials, we provide workshops and exercises to learn about AWS services. But today I'm going to talk something different. I am working for a couple of customers, and for one of the customers, we are building Data Lake. And recently we have been arguing about whether we should keep our data in CSV format or Parquet format. And we are doing the research and everything. And we came across a couple of key differentiators and points. And when I learned about those things, I said, why not do a little experiment? And see, because once you see, you believe it, right? And that's the experiment I'm sharing with you. If you are struggling with the same kind of question, then you can do this experiment yourself and make more, yeah, I would say, a knowledgeable decision about which format to use. So the two key differentiations which we came across, one was that ParaQuest takes smaller storage compared to CSV. So we're not sure how much smaller it will be, but definitely we learned that ParaQuest used a compressed format and column-based storage, and it is much smaller than CSV. The second argument which we saw was that since ParaQuest is column-based storage, the amount of data you scan for the query is less. And since you're, you know, so that's a very interesting thing, because in CSV, we know that the data is stored in kind of a table format, yeah, row column format, but ParaQuest has a different altogether column-based storage. And people say that if you go for that kind of storage, then if you're trying to run some kind of aggregation function and so on and so forth, it only involves those columns which you are using in your query, and your total amount of data scanned is less. So we said, okay, sounds good to me. And to be honest, these two have quite an implication in AWS from the cost and performance point of view. So taking in the first point, if ParaQuest takes less storage than CSV, then definitely it is going to cost us less in AWS, because if I'm, say, storing data inside S3, then I'm paying for less storage cost. For the second point, this claim that ParaQuest gives a better query performance because it is a column-based storage, and yeah, that's good news. And the second thing is that since in ParaQuest, the amount of data you scan is less, it again gives a better cost in AWS. Because if you're using a solution like S3 and Athena for the query, or Athena driver with your reporting tools for the query, then Athena or S3 are charged based on the amount of data you are scanning. And if the amount of data is scanned is less, then you are paying less. So definitely, it seems that Parquet is going to give me a better cost performance if the amount of data scanned is less. But how much? Again, I don't know. So the biggest question was that let's visualize these things. Can we see these things happening? And then make a decision based on that. So seeing is believing. So we said, okay, let's do a small experiment in AWS. And based on that experience, we'll try to make a more knowledgeable data point based decision. And I'm going to share you that experiment. And if you want to do similar experiment for your own decision, to be honest, it's not complicated. It can be done in no time. Okay, so let me go back, go through my experiment. So first thing I did is that I went to a website called kaggle.com. And if you're using kaggle.com, actually, you can get a variety of sample data over here. And here in the kaggle.com, actually, the data I used was for a FIFA 20 complete player data set, which is roughly 48 MB of the data, which is spread across these six CSV files. So I said, okay, let's take this as a sample data to work with. And if you try to aggregate the different rows across this data, they are roughly 100,000 rows. So what I did, the first step is that I created S3 bucket. So if I go to, for instance, yeah, if I go to my S3 bucket, and let me quickly, if I go to my S3 bucket and look at my CSV players. So you can see that these six files I uploaded over here from Kaggle, they are in CSV format. And look at the size of the table, 7.2 MB to 8.5 MB. So this is important. Have a note of this, mental note of this, because we'll come back to this comparison. So what I did after this, once this data was stored, we went to Lake Formation and in Lake Formation, we created a database first, and we simply cataloged this CSV table. So the CSV data you see here over here, yeah, that has been cataloged here in the table. And if I go and see this table, you can see, yeah, it is in CSV format, stored in this folder here. And then this is the schema of the data, 20 columns. Then I had my CSV data ready. Now I need to convert this CSV data into Parkway. And in order to do so, what I did is that I created a glue job. So I went to the glue and created a new job. I simply used out-of-box job which converts a CSV to Parkway. So if you look at this one, it's a pretty configurable job, which generates a script for me. And if you look at the script over here, I'm reading from CSV player, yeah, from Dojo database. And then I'm simply transforming and writing it back to S3 at the location, same bucket, but different folders called Parkway players. And if I go a little more on the right, it is converting my data into Parkway format. So basically I'm trying to convert the same data from CSV to Parkway format and store under this folder in my Dojo data bucket. So when I run this format, I run this job, it simply did the ETL thing and it converted my CSV to Parkway and stored the data at a destination. So if I go back to my table, so if I go to my Dojo data again, and if I try to see the Parkway players folder, you can see those six files has been converted into Parkway format. Now, first observation, which is again very important. Look at the data size now. It is 2.3 to 2.7 MB compared to 7.5 to 8.5. That means your data storage in Parkway has lowered by three times. And this is very important because then that means you are paying three times less storage cost if you are storing your data in Parkway format. So that was the finding number one. If I go back to my presentation, remember here this presentation first one, that it has implication on the cost. And if I try to use this experiment, then basically what we are saying here is that if I'm using my data in Parkway format, then actually I am paying three times less than CSV format for the storage. So that's the one thing. So keep this in mind. So first thing was validated and we got an idea, okay, roughly three times it can do. Well, not necessarily in all cases, I'm assuming, data, data, it depends. But yeah, it looks like it can do quite in comparison. So we said, okay, fine. Having done this thing, let's see the query part because the second argument is that with Parkway being the column level storage, your data is scanned is less. But that's quite an implication in terms of better performance and low cost. So can we check that? So we said, let's do that. So what we did, we went to the Athena. And in Athena, we have a database and these two tables you can see. And we created two similar queries. And what these queries do is that simply they aggregate, they take average of the age for nationality of the players. And we first try to run this query for the CSV players table, then try to run the same query for Parkway players and try to compare it. So let's do this. So let's run this query first for the CSV. When I run this query, I come across, this is the result. And you can see that the query took 2.6 seconds. And the data is scanned was 48.2 MB. So if I look at this, it has scanned the entire data set. We can remember the original data size is 48 MB. And it has scanned the entire data size to give me this result. So fair enough, that's how CSV works. Then we said, let's run the same query in Parkway. So if I run the same query in Parkway, let's run it. Then what we find is that the query time is 2.77. So it looks like a little more than what we saw in CSV. But the data is scanned is 170 KB compared to 48 MB. And this you can actually see in comparison. If I go to history over here, you can see the comparison of the queries. Here I have done it a couple of times just to test it. And you can see here again that when I'm running this query against CSV, then it completes in 2.6 seconds. But your data scan is 48.21 MB. But if I try to run in Parkway, it gives me a little slower performance. But my data scan is only 170 KB. And to be honest, this query I have run a couple of times. And I find that sometimes one is less, one is more, sometimes it's almost the same. So my finding was, finding number one, I could not really establish if Parkway can give me a better query performance over CSV. Probably it can, but not in my experiments. Most of the time they were like, yeah, sometimes one is faster, sometimes other is faster, sometimes both are almost the same. Yeah. So I could not really establish if Parkway can give me a better query performance over ASB or other way around. But I can really establish that the amount of data scanned in Parkway is way less than the amount of data you scan in CSV. And that definitely has a positive impact on cost performance, lowering the cost. Because as I mentioned earlier, that if you're using a query tool like Athena, or if you're using reporting tools which use Athena as a driver to query the data, or a glue catalog to query data, then the charges are based on the amount of data you are scanning. And you can see here the amount of data scanned in Parkway is way less than the amount of data scanned in CSV. So definitely Parkway is going to give you much better cost performance for the query than CSV. However, the runtime is almost same or mixed between these two. So that was the little experiment which should give us a good idea in terms of if you go from CSV to Parkway kind of format, then what kind of effectiveness, cost optimization we are looking into. That really helps us make up our mind for our customer. If you are struggling with similar kind of struggle at your end, you can do similar experiment. If you want me to create a tutorial for this, I can do that as well. But yeah, let me know. So if you like the video, then yeah, please click on the Like button. And please subscribe to our channels, our channel if you want to learn more about such videos in the coming days. We try to upload at least two to three videos per week. And you can always go to our website called aws-dozo.com where you can see loads of workshops and exercises to implement certain scenario. And by doing so, you'll learn about AWS services. So that was all for today. Hope you like it. Please do click on the Like button if you like the video. And I look forward to your feedback and suggestions. If you have any feedback and suggestions, you can provide us on our YouTube channel. Or you can also click on this Contact Us button on our website and can provide us feedback over there. Yeah. That's all for now. Thank you very much for your time. Have a nice day. Bye bye.