Bridging Data and Code: A Software Engineer's Perspective on Data Pipeline Setup

harshitkohli1997 commented 1 year ago

Title

Describe your Talk

From the perspective of a software engineer, I embarked on a remarkable journey within my organization as I transitioned into the role of a data engineer. This transition was driven by my passion for building sophisticated analytics ecosystems and alleviating the burden of CPU load from our production database. Armed with my software engineering skills, I embraced the challenges of data engineering and set out to establish robust data pipelines. Using my expertise in programming, particularly in languages like Python, I designed and implemented efficient data extraction, transformation, and loading processes. By doing so, I not only relieved the strain on our production database but also ensured that data flowed seamlessly through our analytics ecosystem. I employed various technologies and frameworks to optimize data storage, retrieval, and processing, thus enabling our organization to derive valuable insights from our vast data repositories. This transition not only allowed me to leverage my software engineering experience but also empowered me to contribute significantly to our organization's data-driven decision-making processes.

Pre-requisites & reading material

Basic knowledge of python and enthusiasm to learn Description:

From the perspective of a software engineer, I embarked on a remarkable journey within my organization as I transitioned into the role of a data engineer. This transition was driven by my passion for building sophisticated analytics ecosystems and alleviating the burden of CPU load from our production database. Armed with my software engineering skills, I embraced the challenges of data engineering and set out to establish robust data pipelines. Using my expertise in programming, particularly in languages like Python, I designed and implemented efficient data extraction, transformation, and loading processes. By doing so, I not only relieved the strain on our production database but also ensured that data flowed seamlessly through our analytics ecosystem. I employed various technologies and frameworks to optimize data storage, retrieval, and processing, thus enabling our organization to derive valuable insights from our vast data repositories. This transition not only allowed me to leverage my software engineering experience but also empowered me to contribute significantly to our organization's data-driven decision-making processes.

Introduction: (5 mins)

Topic brief introduction
Why data lake?
How with limited knowledge, Python web developers can create robust data pipelines that are ready for production

Unlocking Data Insights: How Cloud Storage Empowers Data Lakes: (5 mins)

Introduction to Big data file formats (Avro, parquet etc), columnar formats
Art of choosing an appropriate file format to cater the specific use cases
Transactional capabilities on top of existing data lakes or data storage

Setting up data pipeline on production: (12 mins)

Introduction to Pyspark and setting up hello world data pipeline, implementing data pipeline in practical setting( Data Collection, Data Processing, Storage and Organization, Analysis and Insights, Monitoring and Optimization)
Executing query our data lake with the pyspark, optimisation on the data lake, partitioning techniques.

Impact of setting up a data-lake:(3 mins)

The database experienced a decrease in CPU utilization (Almost 45- 50 % dip ).
We achieved cost savings exceeding $50,000 by transitioning from the use of bQuery to our internally developed data lake solution
Improved the analytics ecosystem, data integration from diverse sources

conclusion and QNA (5mins)

Open discussion and Q&A session to address participants' queries and provide further clarification on the topics.
Summary of key takeaways and recommendations for participants to apply the learned techniques in their own ecosystems
Closing remarks

Time required for the talk

30 mins

Link to slides/demos

No response

About you

Harshit Kohli is 25 year old software engineer currently working at Milkbasket(it is India’s first and largest daily micro-delivery service). also with a background working with industry giants like Blinkit and Classplus. Self studying Data Engineering

Availability

22/07/2022 or any other day

Any comments

No response

pulsar17 commented 1 year ago

Hi @harshitkohli1997 , a few questions:

Are you part of the Telegram group? (If no, please share your username. If yes, please share your username)
Is there a particular time slot you have in your mind? (The meetup timings are 1-5 pm generally)

Animesh-Ghosh commented 1 year ago

@harshitkohli1997 hey, just pinging since we wanted to finalize the talks.

harshitkohli1997 commented 1 year ago

HI yes i am willing to speak. any time after 2 works for me. No, i'm not part of the telegram group username: harsh_it25

pydelhi / talks