Conda Environments for Data Science
This repository contains conda environment configuration files for Data Engineering, Data Science (including Machine Learning), and related projects. The goal is to establish standardised environments that can be easily shared across multiple projects, reducing the number of virtual environments on each computer and facilitating collaboration among team members.
New environments are created when packages are added, removed, or upgraded. Once an environment specification is defined and published in a YAML file, we consider it immutable. Any changes require creating a new environment to ensure environments are reproducible and maintainable.
The naming convention for environments follows the format E (for Environment) followed by a three-digit number in sequence, continuing from the most recent environment. For example, E001 is followed by E002.
I'm new here where do I start?
If you are part of a team Select an environment that suits your needs,
The answer depends on what you want to achieve and what packages you need so below are some examples and ideas of where to start, most environments contain Jupyter, pandas and numpy
Data collection and handling, this environment contains various libraries to obtain data from a range of sources (make sure to check terms of the 3rd party sites before using)
Docker Install
If you are setting up the environment in a Docker container you might find these Docker install instructions helpful.
Conda Environments
The yml folder contains anaconda environment yml files and a read me that describes at a high level their contents.
How do I add my own changes to an environment?
Once created and published we consider environments to be immutable. When making any alterations or additions please submit the new environment via a pull request with the new environment numbered sequentially from the highest number environment below.
I want to do get serious about deep learning with PyTorch
[E041] provides pytorch
More info
Each environment is created sequentially numbered to allow for versioning and easy tracking.
e.g. Environment 001 which gets called E001
Most of the environments are created by importing a previous environment and updating and adding additional packages.
YML Files
The YML files are named as per the environment name followed by the operating system (eg windows) or "generic" if OS specific packages have been removed and the packages should work on both windows and Linux.
Useful commands
Create a Conda Environment with default imports:
conda create --name [env]
Create a environment from a yml file
conda env create -f [filename].yml
Clone an existing conda environment
conda create --name --clone
Update all packages in an enviroment
conda update --all
Export YML from an existing enviroment
conda env export -n [venv] > [filename].yml
Remove an existing conda environment
conda env remove -n [venv]
List all discoverable environments:
conda info --envs
Add an ENV kernel to Jupyter (when needed)
python -m ipykernel install --user
More information can be found at
https://conda.io/docs/user-guide/tasks/manage-environments.html
Add Channel
conda config --add channels [channel]
Using Mamba
conda install mamba -n base -c conda-forge
mamba install [package_name] -c conda-forge
Windows to Linux Diffrences
Regex to find the hashes at the end of the yml files:
=[A-Z]+.*$
Removing the hashes from a yml file aids the imports into Linux where the compiled hash values are different.
The following packages appear to be windows specific:
The following packages appear to be *nix specific:
Package Conflicts
- torchvision 0.2.2 does not support pillow 7+ due to removal of PILLOW_VERSION. See Github Issue
Enviroments
MinkowskiEngine
pip install -U git+https://github.com/NVIDIA/MinkowskiEngine -v --no-deps --install-option="--blas_include_dirs=${CONDA_PREFIX}/include" --install-option="--blas=openblas"
E044
Environment designed to support machine learning research, including data exploration, pre-processing, model development, training, and deployment. This environment includes libraries such as PyTorch, MLflow (for experiment tracking and reproducibility), NumPy, Pandas, Dask, and Matplotlib, as well as tools for visualisation (Matplotlib) among others
Channels:
- pytorch
- nvidia
- huggingface
- conda-forge
- anaconda
- defaults
To provides:
- python
- pytorch 2.3.0
- mlflow
- numpy
- pillow
- pandas
- dask
- pyarrow
- fastparquet
- pandas-profiling
- xlrd
- sqlite
- matplotlib
- jupyterlab
- jupyter_contrib_nbextensions
- ipywidgets
- widgetsnbextension
- xlrd
- sqlite
- matplotlib
- nodejs
- graphviz
- accelerate
- kornia
- wandb
- matplotlib
- tqdm
- webdataset
- wandb
- munch
- onnxruntime
- einops
- onnx2torch
E043
Copy of E041 with updated packages and HuggingFace transformers added used with gpt-neo.
To provides:
- pytorch=1.11
- python=3.9
- cuda=11.5
- cudnn=8.3.2
- pip:
- tokenizers==0.13
- transformers==4.23
E042
Created from scratch using channels:
- conda-forge
- anaconda
To provide :
- mamba
- python
- openjdk
- ipykernel
- newspaper3k
- numpy
- alpha_vantage
- yfinance
- pandas-datareader
- pandas
- jupyterlab
- matplotlib
- seaborn
- fastparquet
- pandas-profiling
- graphviz
- dask
- nodejs
- sqlite
- plotly
- quandl
- scipy
- xlrd
- h5py
- scikit-image
- scikit-learn
- pillow requests
- youtube-dl
- mlflow
- pyarrow
- beautifulsoup4
- indexed_gzip
- urllib3
- pytrends
- pyautogui
- black
- pyspark=3.2
E041
Created from scratch using channels:
- pytorch
- conda-forge
- anaconda
To provide :
- pytorch
- torchaudio
- torchvision
- python
- opencv
- mlflow
- ax-platform
- botorch
- gpytorch
- pillow
- tensorboard
- tensorboardx
- databricks-cli
- pycocotools
- indexed_gzip
- numpy
- pandas
- dask
- pyarrow
- fastparquet
- h5py
- pandas-profiling
- xlrd
- sqlite
- matplotlib
- jupyterlab
- scipy
- scikit-learn
- scikit-image
- nodejs
- graphviz
- seaborn
- jupyter_contrib_nbextensions
- ipywidgets
- widgetsnbextension
- openblas-devel
E040
Created from scratch using channels:
- pytorch
- conda-forge
- anaconda
- acellera
To provide :
- pytorch=1.10
- torchaudio
- torchvision=0.11
- python=3.9
- mlflow=1.20
- ax-platform
- botorch
- gpytorch
- pillow
- tensorboard
- tensorboardx
- databricks-cli
- pycocotools
- indexed_gzip
- numpy
- pandas
- dask
- pyarrow
- fastparquet
- h5py
- pandas-profiling
- xlrd
- sqlite
- matplotlib
- jupyterlab
- scipy
- scikit-learn
- scikit-image
- nodejs
- graphviz
- seaborn
- jupyter_contrib_nbextensions
- ipywidgets
- widgetsnbextension
E039
Created from scratch using channels:
- pytorch
- conda-forge
- anaconda
To provide :
- pytorch=1.10
- torchaudio
- torchvision=0.11
- python=3.9
- mlflow=1.20
- pillow
- tensorboard
- tensorboardx
- databricks-cli
- pycocotools
- indexed_gzip
- numpy pandas
- dask pyarrow
- fastparquet
- h5py
- pandas-profiling
- xlrd
- sqlite
- matplotlib
- jupyterlab
- scipy
- scikit-learn
- scikit-image
- nodejs
- graphviz
- seaborn
- jupyter_contrib_nbextensions
- ipywidgets
- widgetsnbextension
E038
Created from scratch to provide fastai 2.4 and fastbook
Using channels:
- pytorch
- conda-forge
- anaconda
- mamba
- python=3.9
- mlflow=1.18
- pillow
- tensorboardx
- databricks-cli
- indexed_gzip
- dask
- numpy
- pandas
- matplotlib
- jupyterlab
- scikit-learn
- sqlite
- xlrd
- plotly
- nodejs
- graphviz
- seaborn
E037
Created from scrach to provide pytorch 1.8.1
Using channels:
- pytorch
- conda-forge
- anaconda
Packages:
- mamba
- python=3.9
- pytorch=1.8.1
- torchvision=0.9.1
- mlflow=1.16
- pillow
- tensorboard
- tensorboardx
- pillow
- databricks-cli
- pycocotools
- indexed_gzip
- dask
- pillow
- pandas-profiling
- pyarrow
- numpy
- pandas
- matplotlib
- jupyterlab
- scikit-learn
- scikit-image
- sqlite
- xlrd
- plotly
- nodejs
- graphviz
- fastparquet
- seaborn
- dask
- pillow
- scipy
- h5py
- matplotlib
- jupyter_contrib_nbextensions
E036
Created from scrach to create QR Codes
conda-forge:
- segno
- qrcode-artistic
- pillow
- jupyterlab
- nodejs
E035
Created from scrach as data handler:
- mamba (conda-forge)
- newspaper3k (conda-forge)
- pyautogui (conda-forge)
- numpy (conda-forge)
- pandas (conda-forge)
- jupyterlab (conda-forge)
- matplotlib (conda-forge)
- h5py (conda-forge)
- scikit-image (conda-forge)
- scikit-learn (conda-forge)
- pillow (conda-forge)
- requests (conda-forge)
- youtube-dl (conda-forge)
- mlflow=1.14 (conda-forge)
- pyarrow (conda-forge)
- beautifulsoup4 (conda-forge)
- indexed_gzip (conda-forge)
- xlrd (conda-forge)
- quandl (conda-forge)
- urllib3 (conda-forge)
- scipy (conda-forge)
- pytrends (conda-forge)
- plotly (conda-forge)
- sqlite (conda-forge)
- databricks-cli (conda-forge)
- nodejs (conda-forge)
- pandas-profiling (conda-forge)
- graphviz (anaconda)
- dask (anaconda)
- fastparquet (anaconda)
- seaborn
E034
Created from scrach using mamba top get TF2.4.1 for GPU.
- mamba
- tensorflow-gpu=2.4 (anaconda) (available for linux only as of 2021-03-10)
- tensorboard=2.4.1 (conda-forge)
- mlflow=1.14 (conda-forge)
- pyarrow (conda-forge)
- mlflow=1.14 (conda-forge)
- pyarrow (conda-forge)
- indexed_gzip (conda-forge)
- xlrd (conda-forge)
- scipy (conda-forge)
- plotly (conda-forge)
- sqlite (conda-forge)
- databricks-cli (conda-forge)
- nodejs (anaconda)
- numpy=1.19.5 (anaconda) (version needed by TF)
- pandas (anaconda)
- jupyterlab (anaconda)
- matplotlib (anaconda)
- h5py (anaconda)
- scikit-image (anaconda)
- scikit-learn (anaconda)
- pillow (anaconda)
E033
New env from scratch to provide data wrangling and collection tools
- graphviz (anaconda)
- dask (anaconda)
- pandas-profiling (anaconda)
- fastparquet (anaconda)
- seaborn (anaconda)
- conda-forge (conda-forge)
- nodejs (conda-forge)
- newspaper3k (conda-forge)
- numpy (conda-forge)
- pandas (conda-forge)
- jupyterlab (conda-forge)
- matplotlib (conda-forge)
- h5py (conda-forge)
- scikit-image (conda-forge)
- scikit-learn (conda-forge)
- pillow (conda-forge)
- requests (conda-forge)
- youtube-dl (conda-forge)
- mlflow=1.13 (conda-forge)
- pyarrow (conda-forge)
- beautifulsoup4 (conda-forge)
- indexed_gzip (conda-forge)
- xlrd (conda-forge)
- quandl (conda-forge)
- urllib3 (conda-forge)
- scipy (conda-forge)
- pytrends (conda-forge)
- quandl (conda-forge)
- plotly (conda-forge)
- sqlite (conda-forge)
- databricks-cli (conda-forge)
- alpha-vantage (pip)
- tinysegmenter (pip)
E032
Built from scratch (similar to E028) to provide:
- pytorch=1.7
- torchvision=0.8
- pytorch-lightning
- mlflow=1.13
- pillow=6.2.1
- pandas-profiling
- dask
- pyarrow
- numpy
- pandas
- matplotlib
- jupyterlab
- tensorboardx
- scikit-learn
- scikit-image
- scipy
- h5py
- sqlite
- databricks-cli
- pycocotools (as of 2021-01-15 only available for linux, can be used for computing the evaluation IOU metrics)
E031
Built from scratch for use with labelme
E030
Built from scratch to provide:
- tensorflow-gpu=2.3
- tensorboard=2.3
- tensorflow-gpu=2.3 (as of 2021-01-07 tensorflow-gpu=2.3 only available for windows)
- keras=2.4.3
- tensorboard
- scikit-image
- scikit-learn
- scipy jupyterlab
- h5py=2.10
- dask=2.3
- pillow=8
- pandas-profiling
- pyarrow
- numpy
- pandas
- mlflow=1.13
- Databricks-cli=0.9
- matplotlib
E029
Clone of E026 with xlrd added
E028
Built from scratch to provide:
- pytorch=1.7
- torchvision=0.8
- pytorch-lightning
- mlflow=1.12
- pillow=6.2.1
- pandas-profiling
- dask
- pyarrow
- numpy
- pandas
- matplotlib
- jupyterlab
- tensorboardx
- scikit-learn
- scikit-image
- scipy
- h5py
- sqlite
- databricks-cli
E027
Intended for use of fast AI and MLFlow together
- fastai=1.0
- pytorch
- mlflow=1.12
- pillow=8
- pandas-profiling
- dask
- pyarrow
- numpy
- pandas
- matplotlib
- jupyterlab
- tensorboardx
- scikit-learn
- scikit-image
- scipy
- h5py
- databricks-cli
- sqlite
E026
Built from scratch as data handling env to work with Delta Lake and Apache Spark (Pyspark)
- Pyarrow
- Pandas
- Numpy
- Jupyterlab
- Matplotlib
- H5py
- Scikit-image
- Scikit-learn
- MLFlow=1.12
- Pillow=8.0
- youtube-dl
- pandas-profiling
- dask
- beautifulsoup4
- indexed_gzip
- urllib3
- Pyarrow
- pytrends
- alpha-vantage
- quandl
- plotly
- scipy
- scikit-image
- scikit-learn
- seaborn
- newspaper3k
E025
Built from scratch to provide:
- Tensorflow CPU=2.3 (tensorflow-gpu 2.2 is highest version available on anaconda channel at time of env creation)
- Tensorboard=2.3
- MLFlow=1.11
- Databricks-cli=0.9
- Dask=2.3
- Pillow=8
- Pandas Profiling
- Pyarrow
- Numpy
- Pandas
- Jupyterlab
- Matplotlib
- H5py
- Scikit-image
- Scikit-learn
- Scipy
E024
Built from scratch to provide:
- FastAI=2.1
- Pytorch=1.6
- Torchvision
- TensorboardX=2.1
- MLFlow
- Databricks-cli
- Dask
- Pillow
- Pandas Profiling
- Pyarrow
- Numpy
- Pandas
- Jupyterlab
- Matplotlib
- H5py
- Scikit-image
- Scikit-learn
- Scipy
E023
Built from scratch to provide:
- Tensorflow GPU 2.1
- Tensorboard 2.2
- Keras GPU 2.3
- MLFlow 1.11
- Databricks-cli
- Jupyterlab 2.2
- Numpy 1.19
- Pandas 1.1
- Matplotlib 3.3
- Pillow 8.0
E022
Built from scratch to provide:
- Pytorch GPU 1.6
- MLFlow 1.11
- TensorboardX 2.1
- Jupyterlab 2.2
- Databricks-cli
- Numpy 1.19
- Pandas 1.1
- Torchvision 0.7
- Matplotlib 3.3
- Pillow 8.0
E021
Rebuilt based on E020 to provide:
- Tensorflow GPU 2.1
- Pytorch GPU 1.3
- MLFlow
E020
Based on a clone of E019 with alpha-vantage and qandl packages added, tensorflow 2.1.0 GPU and pytorch CPU.
E019
Clone of E018 with wandb added via pip
E018
Clone of E017 with update to all packages, and pytorch forced to version 1.4 resulting in CUDA tools to 10.1
E017
Clone of E015 with packages to aid data acquisition and handling:
E016
New freshly created ENV to allow TensorFlow 2 GPU to be used.
It supplies:
- TensorFlow 2
- Pillow
- pyarrow
- pandas-profiling
- dask
- scikit-image
- scikit-learn
- scipy
E015
Brand new (not cloned from other) environment with unconstrained versions, this expands on E014 and is currently working with Deoldify.
- Use pip install ffmpeg-python as the conda version is not working at time of writing
- jupyterlab
- pytorch
- matplotlib
- xlrd
- seaborn
- xlrd
- plotly
- numpy
- pandas-profiling
- pandas
- scikit-image
- scikit-learn
- scipy
- pyarrow
- pillow
- dask
- fastai
- pydotplus
- py-xgboost
- cufflinks-py
- keras-gpu
- nvidia-ml-py3
- ffmpeg
- opencv
- youtube-dl
- opencv
E014
Brand new (not cloned from other) environment with unconstrained versions
- jupyter
- jupyterlab
- pytorch
- matplotlib
- xlrd
- seaborn
- xlrd
- plotly
- numpy
- pandas-profiling
- pandas
- scikit-image
- scikit-learn
- scipy
- pyarrow
- pillow
- dask
- fastai
- pydotplus
- py-xgboost
- pyarrow
- cufflinks-py
- keras-gpu
E013
- Clone from E012
- pyarrow
- cufflinks-py
E012
- Clone from E010
- pydotplus
- py-xgboost
- fastai
- update:
E011
Intended as a generic environment for basic Analytics and Data Science
- plotly
- numpy
- pandas
- seaborn
- matplotlib
- jupyterlab
- xlrd
- scikit-image
- scikit-learn
- pillow
- scipy
- dask
- tensorflow (cpu)
- pandas-profiling
E010
- Clone from E009
- pandas-profiling 2.3
- update:
- tensorflow to 1.13.1 --> 1.14.0
- bokeh 1.2.0--> 1.3.4
- cudnn 7.3.1 --> 7.6.0
- dask 2.0 --> 2.2
- jupyterlab 0.35.6 --> 1.0.5
- numpy 1.16.3 --> 1.16.4
- pandas 0.24.2 --> 0.25.0
- scikit-learn 0.21.2 --> 0.21.3
- scipy 1.2.1 --> 1.3.0
- pillow 6.0.0 --> 6.1.0
E009
- All from E008
- plotly
- jupyterlab
E008
E007
- All from E002
- xlrd
- scikit-image
E006
- Python 3.6
- Seaborn
- xlrd
- Jupyter
- tensorflow=1.6
- pip imports not working so import E006 then run pip install opencv-contrib-python
E005
- All from E002
- xlrd
- pip imports not working so import E005 then run pip install opencv-contrib-python
E004
E003
- All from E002
- pyro-ppl
- pip imports not working so import E003 then run pip install pyro-ppl
E002
- All from E001
- Pytorch
- Seaborn
E001