viphat / til

Today I Learned
http://notes.viphat.work
0 stars 1 forks source link

Data Engineers vs Data Scientists #287

Open viphat opened 6 years ago

viphat commented 6 years ago

Source: Oreilly

It's important to understand the differences between a data engineer and a data scientist. Misunderstanding or not knowing these differences are making teams fail or underperform with big data.

Misconceptions

Figure1-42ac4b8503ed9b17c941d48f6c81f147

Both positions work on big data. However, what each position does to create value or data pipelines with big data is very different. This difference comes from the base skills of each position:

Figure2-514e275cca0a0450485a49c805aeb321

Data scientists' skills

At their core, data scientists have a math and statistics background (sometimes physics). Out of this math background, they're creating advanced analytics. On the extreme end of this applied math, they're creating machine learning models and artificial intelligence.

Just like their software engineering counterparts, data scientists will have to interact with the business side. This includes understanding the domain enough to make insights. Data scientists are often tasked with analyzing data to help the business, and this requires a level of business acumen. Finally, their results need to be given to the business in an understandable fashion. This requires the ability verbally and visually communicate complex results and observations in a way that the business can understand and act on them.

A data scientist is someone who has augmented their math and statistics background with programming to analyze data and create applied mathematical models.

In order to accomplish a more complicated analysis or because of an otherwise insurmountable problem, data scientists learned how to program. Their programming and system creation skills aren't the levels that you'd see from a programmer or data engineer - nor should they be (Họ cũng không nên như thế).

Data Engineers' skills

At their core, data engineers have a programming backgrounds. This background is generally in Java, Scala, or Python. They have an emphasis or specialization in distributed systems and big data. A data engineer has advanced programming and system creation skills.

a data engineer is someone who has specialized their skills in creating software solutions around big data.

Using these engineering skills, they create data pipelines. Creating a data pipeline may sound easy or trivial, but at a big data scale, this means bringing together 10-30 different big data technologies. More importantly, a data engineer is the one who understands and chooses the right tools for the job. A data engineer is the one who understands the various technologies and frameworks in-depth, and how to combine them to create solutions to enable a company's business processes with data pipelines.

Overlapping skills

There is an overlap between a data scientist and a data engineer. However, the overlap happens at the ragged edges of each one's abilities.

For example, they overlap on analysis. However, a data scientist's analytics skills will be far more advanced than a data engineer's analytic skills. A data engineer can do some basic to intermediate level analytics, but will be hard pressed to do the advanced analytics that a data scientist does.

Both a data scientist and a data engineer overlap on programming. However, a data engineer's programming skills are well beyond a data scientist's programming skills. Having a data scientist create a data pipeline is at the far edge of their skills, but is the bread and butter of a data engineer. In this way, the two roles are complementary, with data engineers supporting the work of data scientists.

Data engineers use their programming and systems creation skills to create big data pipelines. Data scientists use their more limited programming skills and apply their advanced math skills to create advanced data products using those existing data pipelines. This difference between creating and using lies at the core of a team's failure or underperforming with big data. A team that expects their data scientists to create data pipelines will be woefully disappointed.

When Organizations get it wrong

Data scientists doing data engineering
Ratios of data engineers to data scientists

Having more data scientists than data engineers is generally an issue. It typically means that an organization is having their data scientists do data engineering. As I’ve shown, this leads to all sorts of problems.

You need more data engineers because more time and effort is needed to create data pipelines than to create the ML/AI portion.

Data Engineers doing data science

The need for machine learning engineers

Figure3-7c5de9f92f3406e23d76e6bff3f89818

Machine Learning Engineers primarily come from data engineering backgrounds. They're cross-trained enough to become proficient at both data engineering and data science.

A machine learning engineer is someone who sits at the crossroads of data science and data engineering, and has proficiency in both data engineering and data science.

Machine learning engineers and data engineers

The transition of data engineer to machine learning engineer is a slow-moving process.

To explain what I mean by slow moving, I will share the experience of those who I’ve seen make the transition from data engineer to machine learning engineer. They’ve spent years doing development work as a software engineer and then data engineer. They’ve always had an interest in statistics or math. Other times, they just got bored with the constraints of being a data engineer. Either way, this transition took years. I’m not seeing people become machine learning engineers after taking a beginning stats class or after taking a beginning machine learning course.

What to do?

... A new position, machine learning engineer. As your data science and data engineering teams mature, you’ll want to check the gaps between the teams. You may need to promote a data engineer on their way to becoming a machine learning engineer or hire a machine learning engineer. ...