WIP: The whole repo is under rework
A project by Martina Pugliese.
This book is a collection of notes on Data Science, from Statistics to Machine Learning, passing through all sorts of related areas.
I've decided to give form to a rather disorderly collection of notes I had about data science & all sorts of related areas, which is how this project has generated. You can read more in the Meta page about the how's and the why's of this.
This section explains how this whole thing has started and why, what it is and how it's done, plus some awesome resources found on the web.
A collection of notes on topics regarding Probability and Statistics and the way to use them to analyse data and draw conclusions.
How do we do Machine Learning? This chapter offers a high-level overview of the techniques and methodologies.
This chapter is pretty much a page for each algorithm in "shallow learning", that is, all non "deep". Neural networks, even when shallow, are not presented here as there is a dedicated chapter on them, which is the same chapter that dives into deep learning. The division here is into the main learning paradigms.
This part deals with how to assess the quality of a model and diagnose problems.
Digging into the world of Artificial Neural Networks, a fascinating area of Machine Learning particularly on the rise these days. This deserved its own chapter.
Natural Language Processing (NLP) is the field (a part of Machine Learning) which deals with text, an unstructured data source. What NLP tries to do is putting text into numerical representations, and extracting information from it.
Images, seen by the machine. This section deals with using computers to extract and use information from visual data. We will illustrate a whole set of methods, which may or may not encompass the use of Neural Networks.
Some (non-comprehensive) notes on Computer Science fundamentals.
Some (non-comprehensive) notes on mathematics, used everywhere in data work. Useful little bits.
(Some) software tools used in Data Science, high-level overviews.
Several pages contain snippets of code. I've been using Python (3) and for those pages a link to a relative Jupyter notebook in the Github repo corresponding to this book is provided for your perusal if you want to play around. The overall repo is reachable on **[Github](https://github.com/martinapugliese/tales-science-data/tree/master) and you can also visualise the notebooks prettyfied via the [Jupyter Notebooks viewer**](https://nbviewer.jupyter.org/github/martinapugliese/tales-science-data/tree/master/).
The libraries used in the notebooks are usually (unless specified) those of the Python data stack (Numpy, Scipy, sklearn, Pandas, ...). The plots presented in here have been customised, the repo contains all styling files.
Mistakes happen. Inaccuracies and oversights as well, from the content point to view to the rendering/graphics one (e.g., one TeX formula doesn't appear rendered). You are more than welcome, encouraged in fact, to submit issues to the repo for these things.
(C) 2017-2024 Martina Pugliese
This book is released under the Creative Commons NoDerivatives 4.0 International (CC BY-NC-ND 4.0).