taghizadeh598 / Data-mining

0 stars 0 forks source link

Introduction #1

Open taghizadeh598 opened 8 months ago

taghizadeh598 commented 8 months ago

Data mining is most commonly defined as the process of using computers and automation to search large sets of data for patterns and trends, turning those findings into business insights and predictions. Data mining goes beyond the search process, as it uses data to evaluate future probabilities and develop actionable analyses.

Data mining and machine learning are unique processes that are often considered synonymous. However, while they are both useful for detecting patterns in large data sets, they operate very differently.

Data mining is the process of finding patterns in data. The beauty of data mining is that it helps to answer questions we didn’t know to ask by proactively identifying non-intuitive data patterns through algorithms (e.g., consumers who buy peanut butter are more likely to buy paper towels). However, the interpretation of these insights and their application to business decisions still require human involvement.

Machine learning, meanwhile, is the process of teaching a computer to learn as humans do. With machine learning, computers learn how to determine probabilities and make predictions based on their data analysis. And, while machine learning sometimes uses data mining as part of its process, it ultimately doesn’t require frequent human involvement on an ongoing basis (e.g., a self-driving car relies on data mining to determine where to stop, accelerate, and turn).

taghizadeh598 commented 8 months ago

First step: Have the right data mining tools for the job – install Jupyter, and get familiar with a few modules.

First things first, if you want to follow along, install Jupyter on your desktop. It’s a free platform that provides what is essentially a processer for iPython notebooks (.ipynb files) that is extremely intuitive to use. Follow these instructions for installation. Everything I do here will be completed in a “Python [Root]” file in Jupyter.

We will be using the Pandas module of Python to clean and restructure our data. Pandas is an open-source module for working with data structures and analysis, one that is ubiquitous for data scientists who use Python. It allows for data scientists to upload data in any format, and provides a simple platform organize, sort, and manipulate that data. If this is your first time using Pandas, check out this awesome tutorial on the basic functions!

In [1]:

import pandas as pd import matplotlib.pyplot as plt import numpy as np import scipy.stats as stats import seaborn as sns from matplotlib import rcParams

%matplotlib inline %pylab inline Populating the interactive namespace from numpy and matplotlib In the code above I imported a few modules, here’s a breakdown of what they do:

Numpy – a necessary package for scientific computation. It includes an incredibly versatile structure for working with arrays, which are the primary data format that scikit-learn uses for input data. Matplotlib – the fundamental package for data visualization in Python. This module allows for the creation of everything from simple scatter plots to 3-dimensional contour plots. Note that from matplotlib we install pyplot, which is the highest order state-machine environment in the modules hierarchy (if that is meaningless to you don’t worry about it, just make sure you get it imported to your notebook). Using ‘%matplotlib inline’ is essential to make sure that all plots show up in your notebook. Scipy – a collection of tools for statistics in python. Stats is the scipy module that imports regression analysis functions.

taghizadeh598 commented 8 months ago

import numpy as np

Create a NumPy array

data = np.array([1, 2, 3, 4, 5])

Calculate the mean and standard deviation

mean = np.mean(data) std_dev = np.std(data)

print("Mean:", mean) print("Standard Deviation:", std_dev)

taghizadeh598 commented 7 months ago

import numpy as np digits = np.array([[1, 2, 3],[4, 5, 6],[6, 7, 9], ]) digits Out array([[1, 2, 3], [4, 5, 6], [6, 7, 9]])

taghizadeh598 commented 7 months ago

import numpy as np table = np.array([ [5, 3, 7, 1], [2, 6, 7 ,9], [1, 1, 1, 1], [4, 3, 2, 0], ]) table.max() Out: 9

table.max(axis=0) Out:array([5, 6, 7, 9]) table.max(axis=1) Out: array([7, 9, 1, 4])

taghizadeh598 commented 7 months ago

import numpy as np A = np.arange(32).reshape(4, 1, 8) A Ou: array([[[ 0, 1, 2, 3, 4, 5, 6, 7]],

   [[ 8,  9, 10, 11, 12, 13, 14, 15]],

   [[16, 17, 18, 19, 20, 21, 22, 23]],

   [[24, 25, 26, 27, 28, 29, 30, 31]]])

B = np.arange(48).reshape(1, 6, 8) B Out: array([[[ 0, 1, 2, 3, 4, 5, 6, 7], [ 8, 9, 10, 11, 12, 13, 14, 15], [16, 17, 18, 19, 20, 21, 22, 23], [24, 25, 26, 27, 28, 29, 30, 31], [32, 33, 34, 35, 36, 37, 38, 39], [40, 41, 42, 43, 44, 45, 46, 47]]])

taghizadeh598 commented 7 months ago

A + B Out: array([[[ 0, 2, 4, 6, 8, 10, 12, 14], [ 8, 10, 12, 14, 16, 18, 20, 22], [16, 18, 20, 22, 24, 26, 28, 30], [24, 26, 28, 30, 32, 34, 36, 38], [32, 34, 36, 38, 40, 42, 44, 46], [40, 42, 44, 46, 48, 50, 52, 54]],

   [[ 8, 10, 12, 14, 16, 18, 20, 22],
    [16, 18, 20, 22, 24, 26, 28, 30],
    [24, 26, 28, 30, 32, 34, 36, 38],
    [32, 34, 36, 38, 40, 42, 44, 46],
    [40, 42, 44, 46, 48, 50, 52, 54],
    [48, 50, 52, 54, 56, 58, 60, 62]],

   [[16, 18, 20, 22, 24, 26, 28, 30],
    [24, 26, 28, 30, 32, 34, 36, 38],
    [32, 34, 36, 38, 40, 42, 44, 46],
    [40, 42, 44, 46, 48, 50, 52, 54],
    [48, 50, 52, 54, 56, 58, 60, 62],
    [56, 58, 60, 62, 64, 66, 68, 70]],

   [[24, 26, 28, 30, 32, 34, 36, 38],
    [32, 34, 36, 38, 40, 42, 44, 46],
    [40, 42, 44, 46, 48, 50, 52, 54],
    [48, 50, 52, 54, 56, 58, 60, 62],
    [56, 58, 60, 62, 64, 66, 68, 70],
    [64, 66, 68, 70, 72, 74, 76, 78]]])
taghizadeh598 commented 7 months ago

import numpy as np a = np.array([[1, 2], [3, 4], [5, 6], ]) a.T out: array([[1, 3, 5], [2, 4, 6]]) a.transpose() Out: array([[1, 3, 5], [2, 4, 6]])

taghizadeh598 commented 7 months ago

import numpy as np data = np.array([[7, 1, 4],[8, 6, 5], [1, 2, 3]]) np.sort(data) Out: array([[1, 4, 7], [5, 6, 8], [1, 2, 3]]) np.sort(data, axis=None) Out: array([1, 1, 2, 3, 4, 5, 6, 7, 8]) np.sort(data, axis=0) Out: array([[1, 1, 3], [7, 2, 4], [8, 6, 5]])

taghizadeh598 commented 7 months ago

import numpy as np a = np.array([ [4, 8], [6, 1]]) b = np.array([ [3, 5],[7, 2]]) np.hstack((a, b)) Out: array([[4, 8, 3, 5], [6, 1, 7, 2]]) np.vstack((b, a)) Out: array([[3, 5], [7, 2], [4, 8], [6, 1]]) np.concatenate((a, b)) Out: array([[4, 8], [6, 1], [3, 5], [7, 2]]) np.concatenate((a, b), axis=None) Out: array([4, 8, 6, 1, 3, 5, 7, 2])

taghizadeh598 commented 7 months ago

These are just the types that map to existing Python types. NumPy also has types for the smaller-sized versions of each, like 8-, 16-, and 32-bit integers, 32-bit single-precision floating-point numbers, and 64-bit single-precision complex numbers. The documentation lists them in their entirety. import numpy as np a = np.array([1, 3, 5.5, 7.7, 9.2], dtype=np.single) a Out: array([1. , 3. , 5.5, 7.7, 9.2], dtype=float32) b = np.array([1, 3, 5.5, 7.7, 9.2], dtype=np.uint8) b Out: array([1, 3, 5, 7, 9], dtype=uint8)

taghizadeh598 commented 7 months ago

String Types: Sized Unicode Strings behave a little strangely in NumPy code because NumPy needs to know how many bytes to expect, which isn’t usually a factor in Python programming. Luckily, NumPy does a pretty good job at taking care of less complex cases for you:

import numpy as np names = np.array(["bob", "amy", "han"], dtype=str) names Out: array(['bob', 'amy', 'han'], dtype='<U3') names.itemsize Out: 12

names = np.array(["bob", "amy", "han"]) names Out: array(['bob', 'amy', 'han'], dtype='<U3') more_names = np.array(["bobo", "jehosephat"]) np.concatenate((names, more_names)) Out: array(['bob', 'amy', 'han', 'bobo', 'jehosephat'], dtype='<U10')

taghizadeh598 commented 7 months ago

names[2] = "jamima" names Out: array(['bob', 'amy', 'jam'], dtype='<U3')

taghizadeh598 commented 7 months ago

Structured Arrays Originally, you learned that array items all have to be the same data type, but that wasn’t entirely correct. NumPy has a special kind of array, called a record array or structured array, with which you can specify a type and, optionally, a name on a per-column basis. This makes sorting and filtering even more powerful, and it can feel similar to working with data in Excel, CSVs, or relational databases.

taghizadeh598 commented 7 months ago

Here’s a quick example to show them off a little: import numpy as np data = np.array([ ("joe", 32, 6), ("mary", 15, 20), ("felipe", 80, 100), ("beyonce", 38, 9001), ], dtype=[("name", str, 10), ("age", int), ("power", int)]) data[0] Out: ('joe', 32, 6) data["name"] Out: array(['joe', 'mary', 'felipe', 'beyonce'], dtype='<U10') data[data["power"] > 9000]["name"] Out: array(['beyonce'], dtype='<U10')

taghizadeh598 commented 7 months ago

pandas

pandas is a library that takes the concept of structured arrays and builds it out with tons of convenience methods, developer-experience improvements, and better automation. If you need to import data from basically anywhere, clean it, reshape it, polish it, and then export it into basically any format, then pandas is the library for you. It’s likely that at some point, you’ll import pandas as pd at the same time you import numpy as np.

taghizadeh598 commented 7 months ago

scikit-learn If your goals lie more in the direction of machine learning, then scikit-learn is the next step. Given enough data, you can do classification, regression, clustering, and more in just a few lines.

If you’re already comfortable with the math, then the scikit-learn documentation has a great list of tutorials to get you up and running in Python. If not, then the Math for Data Science Learning Path is a good place to start. Additionally, there’s also an entire learning path for machine learning.

It’s important for you to understand at least the basics of the mathematics behind the algorithms rather than just importing them and running with it. Bias in machine learning models is a huge ethical, social, and political issue.

Throwing data at models without a considering how to address the bias is a great way to get into trouble and negatively impact people’s lives. Doing some research and learning how to predict where bias might occur is a good start in the right direction.

taghizadeh598 commented 7 months ago

Matplotlib No matter what you’re doing with your data, at some point you’ll need to communicate your results to other humans, and Matplotlib is one of the main libraries for making that happen. For an introduction, check out Plotting with Matplotlib. In the next section, you’ll get some hands-on practice with Matplotlib, but you’ll use it for image manipulation rather than for making plots.

taghizadeh598 commented 7 months ago

Practical Example 2: Manipulating Images With Matplotlib It’s always neat when you’re working with a Python library and it hands you something that turns out to be a basic NumPy array. In this example, you’ll experience that in all its glory.

You’re going to load an image using Matplotlib, realize that RGB images are really just width × height × 3 arrays of int8 integers, manipulate those bytes, and use Matplotlib again to save that modified image once you’re done.

taghizadeh598 commented 7 months ago

Create a Python file called image_mod.py, then set up your imports and load the image: import numpy as np import matplotlib.image as mpimg img = mpimg.imread("kitty.jpg") print(type(img)) print(img.shape)

taghizadeh598 commented 7 months ago

If you run this code, then your friend the NumPy array will appear in the output: $ python3 image_mod.py <class 'numpy.ndarray'> (1299, 1920, 3)

taghizadeh598 commented 7 months ago

It’s an image with a height of 1299 pixels, a width of 1920 pixels, and three channels: one each for the red, green, and blue (RGB) color levels. Want to see what happens when you drop out the R and G channels? Add this to your script: output = img.copy() # The original image is read-only! output[:, :, :2] = 0 mpimg.imsave("blue.jpg", output)

taghizadeh598 commented 7 months ago

Is your mind blown yet? Do you feel the power? Images are just fancy arrays! Pixels are just numbers! But now, it’s time to do something a little more useful. You’re going to convert this image to grayscale. However, converting to grayscale is more complicated. Averaging the R, G, and B channels and making them all the same will give you an image that’s grayscale. But the human brain is weird, and that conversion doesn’t seem to handle the luminosity of the colors quite right. In fact, it’s better to see it for yourself. You can use the fact that if you output an array with only one channel instead of three, then you can specify a color map, known as a cmap in the Matplotlib world. If you specify a cmap, then Matplotlib will handle the linear gradient calculations for you. Get rid of the last three lines in your script and replace them with this: averages = img.mean(axis=2) # Take the average of each R, G, and B mpimg.imsave("bad-gray.jpg", averages, cmap="gray")

taghizadeh598 commented 7 months ago

These new lines create a new array called averages, which is a copy of the img array that you’ve flattened along axis 2 by taking the average of all three channels. You’ve averaged all three channels and outputted something with R, G, and B values equal to that average. When R, G, and B are all the same, the resulting color is on the grayscale.