offchan42 / machine-learning-curriculum

:computer: Learn to make machines learn so that you don't have to struggle to program them; The ultimate list
MIT License
1.1k stars 258 forks source link
chainer convolutional-neural-networks course curriculum deep-learning guide machine-learning mlops-workflow mxnet neural-network python pytorch recurrent-neural-networks reinforcement-learning tensorflow

Machine Learning Curriculum

Machine Learning is a branch of Artificial Intelligence dedicated at making machines learn from observational data without being explicitly programmed.

Machine learning and AI are not the same. Machine learning is an instrument in the AI symphony — a component of AI. So what is Machine Learning — or ML — exactly? It’s the ability for an algorithm to learn from prior data in order to produce a behavior. ML is teaching machines to make decisions in situations they have never seen.

This curriculum is made to guide you to learn machine learning, recommend tools, and help you to embrace ML lifestyle by suggesting media to follow. I update it regularly to maintain freshness and get rid of outdated content and deprecated tools.

Machine Learning in General

Study this section to understand fundamental concepts and develop intuitions before going any deeper.

A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P if its performance at tasks in T, as measured by P, improves with experience E.

Books

Reinforcement Learning

Building a machine that senses the environment and then chooses the best policy (action) to do at any given state to maximize its expected long-term scalar reward is the goal of reinforcement learning.

Deep Learning

Deep learning is a branch of machine learning where deep artificial neural networks (DNN) — algorithms inspired by the way neurons work in the brain — find patterns in raw data by combining multiple layers of artificial neurons. As the layers increase, so does the neural network’s ability to learn increasingly abstract concepts.

The simplest kind of DNN is a Multilayer Perceptron (MLP).

Convolutional Neural Networks

DNNs that work with grid data like sound waveforms, images and videos better than ordinary DNNs. They are based on the assumptions that nearby input units are more related than the distant units. They also utilize translation invariance. For example, given an image, it might be useful to detect the same kind of edges everywhere on the image. They are sometimes called convnets or CNNs.

Recurrent Neural Networks

DNNs that have states. They also understand sequences that vary in length. They are sometimes called RNNs.

Best Practices

Tools

Libraries and frameworks that are useful for practical machine learning

Frameworks

Machine learning building blocks

No coding

Gradient Boosting

Models that are used heavily in competitions because of their outstanding generalization performance.

Time Series Inference

Time series data require unique feature extraction process for them to be usable in most machine learning models because most models require data to be in a tabular format. Or you can use special model architectures which target time series e.g. LSTM, TCN, etc.

Life Cycle

Libraries that help you develop/debug/deploy the model in production (MLOps). There is more to ML than training the model.

GPU Cloud

Remember that this is an opinionated list. There are bazillions of cloud providers out there. I'm not going to list them all. I'm just going to list the ones that I'm familiar with and I think are good.

Data Storage

Data Wrangling

Data cleaning and data augmentation

Data Orchestration

Data Visualization

Hyperparameter Tuning

Before you begin, please read this blog post to understand the motivation of searching in general: https://www.determined.ai/blog/stop-doing-iterative-model-development

Open your eyes to search-driven development. It will change you. Main benefit is that there will be no setbacks. Only progress and improvement are allowed. Imagine working and progressing everyday, instead of regressing backwards because your new solution doesn't work. This guaranteed progress is what search-driven development will do to you. Apply it to everything in optimization, not just machine learning.

My top opinionated preferences are determined, ray tune, and optuna because of parallelization (distributed tuning on many machines), flexibility (can optimize arbitrary objectives and allow dataset parameters to be tuned), library of SOTA tuning algorithms (e.g. HyperBand, BOHB, TPE, PBT, ASHA, etc), result visualization/analysis tools, and extensive documentations/tutorials.

AutoML

Make machines learn without the tedious task of feature engineering, model selection, and hyperparameter tuning that you have to do yourself. Let the machines perform machine learning for you!

Personally if I have a tabular dataset I would try FLAML and mljar first, especially if you want to get something working fast. If you want to try gradient boosting frameworks such as XGBoost, LightGBM, CatBoost, etc but you don't know which one works best, I suggest you to try AutoML first because internally it will try the gradient boosting frameworks mentioned previously.

Model Architectures

Architectures that are state-of-the-art in its field.

Prompt Engineering

Large language models (LLMs) like GPT-3 are powerful, but they need to be prompted to generate the desired output. This is where prompt engineering comes in. Prompt engineering is the process of designing prompts that can be used to generate the desired output.

Nice Blogs & Vlogs to Follow

Impactful People

Cutting-Edge Research Publishers

Steal the most recent techniques introduced by smart computer scientists (could be you).

Practitioner Community

Thoughtful Insights for Future Research

Uncategorized

Other Big Lists

I am confused, too many links, where do I start?

If you are a beginner and want to get started with my suggestions, please read this issue: https://github.com/offchan42/machine-learning-curriculum/issues/4

Disclaimer

From now on, this list is going to be compact and opinionated towards my own real-world ML journey and I will put only content that I think are truly beneficial for me and most people. All the materials and tools that are not good enough (in any aspect) will be gradually removed to combat information overload, including:

NOTE: There is no particular rank for each link. The order in which they appear does not convey any meaning and should not be treated differently.

How to contribute to this list

  1. Fork this repository, then apply your change.
  2. Make a pull request and tag me if you want.
  3. That's it. If your edition is useful, I'll merge it.

Or you can just submit a new issue containing the resource you want me to include if you don't have time to send a pull request.

The resource you want to include should be free to study.


Built with Spacemacs