pydatabangalore / talks

Talks at PyData Bangalore meetups
MIT License
36 stars 11 forks source link

Lessons learned by a rookie data scientist from working in a real data team #10

Closed adityalahiri closed 4 years ago

adityalahiri commented 5 years ago

Title

Lessons learned by a rookie data scientist from working in a real data team.

Description

Things about data that can only be learned when working at scale and in a team. Both technological and ideological.

Duration

Audience

The talk is centered on some key takeaways for a beginner data scientist if he/she is planning to work in the field. It has a blend of technology insights as well as some ways of approaching one's work when part of a data science team. The audience is expected to have a broad understanding of what data science entails and some intermediate knowledge of python and its libraries to understand the technological aspects.

Outline

The talk will broadly be around this medium article that I have written. In addition to this, I will be talking in some detail about how one might work with big data efficiently such as csv files of around 20-30 gb using chunks and other methods. Following is an outline of the talk -

  1. The scale is the nemesis (5mins) If it were a kaggle competition, a 7Gb model that gives out your final predictions in a csv with x% accuracy is as good as a 2Mb one with x-1 %. In real life, not so much. Taking care of not just the current performance of the model, but also of how it would evolve over time is of the essence. Faster iterations need agile models. Models that can improve with the addition of new data and that do not take a huge amount of resource to deploy and obtain feedback. Learning that accuracy is not the only metric and scalability is an equally important factor was crucial.

  2. The team delivers, not just the data scientist(10-15min)

In my usual pet projects, I was the one who cleaned the data, the one who tried models and the one who ignored its reproducibility and sustenance. Projects that add value are made by the combined effort of a data team. This can be a group of roughly 7–8 people, depending also sometimes on the maturity of the team and the Org.

A data scientist is one member of the team. One whose primary role is to drive analysis forward from the data, gather and report insights, using statistical and deep learning models if required to aid in the process. Another is a data engineer. He is somewhere there in the middle of the intersection between backend software development and a big data analyst and is typically in charge of managing data workflows, pipelines, and ETL processes. Then there are members of the business team, who communicate with clients and understand their problems and convey it to the team. They also get the insights from the team and break it down in terms of what the end user needs and then conveys that to them. Besides these people, usually in a well-working data team, there is also someone from high up in the ranks of the org who participates in the day to day functioning of the team. They help the team’s presence felt in the org and communicate to others the work of the members of the team.

  1. Security and Trust(5mins)

Data is the new oil. Data is also the new electricity. Both can be stolen or misused if not taken care of. Working with sensitive data on local machines is a big no-no. Data cannot be freely exported in and out according to one’s whims and fancies. Trust of people and organizations who share their data with your org is of utmost importance. Trust is ensured through security. Security of each data source and each piece of code that is written to work on that data source. The people in charge of this have a huge responsibility on their shoulders. Being an intern, it was very important for me to remember the security aspect of my work and use and store my data judiciously. This almost never happens if it is one of your personal projects. So, being alert while working with the data and its flow, is right up there in the list of necessary steps to take.

  1. Communicate. Ask. Do not get stuck.(5mins)

One thing that is quite different when you are working in an organization as compared to when you are working on an individual project is that you have a number of wiser heads all around you. There is someone who must have worked on this new framework that you are going to use or that new preprocessing technique which you are about to try out. It is best to politely ask these people for some of their valuable time and once you have a sound background of what you are about to attempt, have a quick one to one brainstorm with those guys. This will give you a definite roadmap of the project ahead as well as remove any wrong notions you might have had. The same applies if you have been stuck on a little bug for a while. Ask!

This is the link to the complete medium post

Additional notes

I am a final year computer science undergraduate from BITS Pilani, Goa. I just completed my summer internship at SocialCops, Delhi and I am now at Bangalore for a 6-month internship at American Express, Big Data Labs. I have been working in the field of data science and machine learning from the past 2 years. I have given talks at my college on getting started with data science as a student. I have also been a part of the core organizing committee of Google Developers Group, Goa. During my first year in college, I hosted Mr. Joel Spolsky, CEO and founder of Stack Overflow and Trello, for a talk in my college's tech fest, Quark. Since I am in Bangalore now for at least the next 6 months I am looking forward to being involved in more tech meetups and contribute to them if possible.