taghizadeh598 commented 8 months ago

Data mining is most commonly defined as the process of using computers and automation to search large sets of data for patterns and trends, turning those findings into business insights and predictions. Data mining goes beyond the search process, as it uses data to evaluate future probabilities and develop actionable analyses.

Data mining and machine learning are unique processes that are often considered synonymous. However, while they are both useful for detecting patterns in large data sets, they operate very differently.

Data mining is the process of finding patterns in data. The beauty of data mining is that it helps to answer questions we didn’t know to ask by proactively identifying non-intuitive data patterns through algorithms (e.g., consumers who buy peanut butter are more likely to buy paper towels). However, the interpretation of these insights and their application to business decisions still require human involvement.

Machine learning, meanwhile, is the process of teaching a computer to learn as humans do. With machine learning, computers learn how to determine probabilities and make predictions based on their data analysis. And, while machine learning sometimes uses data mining as part of its process, it ultimately doesn’t require frequent human involvement on an ongoing basis (e.g., a self-driving car relies on data mining to determine where to stop, accelerate, and turn).

taghizadeh598 commented 8 months ago

First step: Have the right data mining tools for the job – install Jupyter, and get familiar with a few modules.

First things first, if you want to follow along, install Jupyter on your desktop. It’s a free platform that provides what is essentially a processer for iPython notebooks (.ipynb files) that is extremely intuitive to use. Follow these instructions for installation. Everything I do here will be completed in a “Python [Root]” file in Jupyter.

We will be using the Pandas module of Python to clean and restructure our data. Pandas is an open-source module for working with data structures and analysis, one that is ubiquitous for data scientists who use Python. It allows for data scientists to upload data in any format, and provides a simple platform organize, sort, and manipulate that data. If this is your first time using Pandas, check out this awesome tutorial on the basic functions!

In [1]:

import pandas as pd import matplotlib.pyplot as plt import numpy as np import scipy.stats as stats import seaborn as sns from matplotlib import rcParams

%matplotlib inline %pylab inline Populating the interactive namespace from numpy and matplotlib In the code above I imported a few modules, here’s a breakdown of what they do:

Numpy – a necessary package for scientific computation. It includes an incredibly versatile structure for working with arrays, which are the primary data format that scikit-learn uses for input data. Matplotlib – the fundamental package for data visualization in Python. This module allows for the creation of everything from simple scatter plots to 3-dimensional contour plots. Note that from matplotlib we install pyplot, which is the highest order state-machine environment in the modules hierarchy (if that is meaningless to you don’t worry about it, just make sure you get it imported to your notebook). Using ‘%matplotlib inline’ is essential to make sure that all plots show up in your notebook. Scipy – a collection of tools for statistics in python. Stats is the scipy module that imports regression analysis functions.

taghizadeh598 commented 8 months ago

import numpy as np

Create a NumPy array

data = np.array([1, 2, 3, 4, 5])

Calculate the mean and standard deviation

mean = np.mean(data) std_dev = np.std(data)

print("Mean:", mean) print("Standard Deviation:", std_dev)