Improving package setup to pyproject.toml

Description

Improve the approach for packaging and use a pyproject.toml approach

Context

The common approach for a python project to be used is either a direct way, mostly like cloning the repository, and a compacted way, mostly in the form of packaging and/or containerization. In either case, for the same version of the software, the same functionality should be available for any given system independently on how it was installed.

Problem

When the user decides to go for the package/container route of installing the software, there will be dependency issues not straight forward solvable for some cases (more yet to be mapped). A non exhaustive list of these problems is:

Python version and sub-version compatibility.
PYTHONPATH values maybe are not updated programatically.
Order of dependency installs might break out inter-dependencies.
Dependencies to download data might conflict with those for running workloads.

Elements for the solution (Draft)

In general, it might me a good opportunity to update the packaging from a setup.py oriented approach to pyproject.toml approach.

PEP 518 – Specifying Minimum Build System Requirements for Python Projects

This PEP specifies how Python software packages should specify what build dependencies they have in order to execute their chosen build system. As part of this specification, a new configuration file is introduced for software packages to use to specify their build dependencies (with the expectation that the same configuration file will be used for future configuration details).

Externally Managed Environments

To allows a Python installation to indicate to Python-specific tools such as pip that they neither install nor remove packages into the interpreter’s default installation environment

Some other details might be useful:

Every package now should be strictly necessary (at least really really hard to not depend on it)
Use of exact versions for every single package.
Include explicit installation order and without using cache installed versions.
Make sure the latest version of pip is installed.
Use of installation flags for python packages.
Compile/store the package's Wheel (offline file) according to particular architectures.

Ok, @priyakasimbeg, here is my proposition of more actionable items to start the first phase of refactoring: And is in the Datasets installation/downloading process actually, previous to the overall project.

Problems:

Dependencies across datasets might differ in time.
Dependencies across datasets and the submission script, or any other script for that matter, might differ in time.
Pre-processing for one dataset might be different to another one.
Changes and updates into the understanding for each dataset might be at different pace.

Improvement oportunities:

Move from a monolithic dataset config to a per-data set config.
Move from a dataset_setup.py config to a pyproject.toml logic.
Extend/update with the following:
1. Keep the ~/data/ and ~/temp/data local folder creation.
2. Define a pyproject.toml file for all datasets.
3. Within pyproject.toml specify a dependency list for each dataset.
4. Create a sub-folder per dataset
  1. Create/relocate/expand downloading, pre-processing, eda scripts.
  2. A README.md for each dataset with some of the following:
    1. Official Name, Creator, License.
    2. Data and File Structure.
    3. Exact Full Size (On decompressed + worst case scenario).
    4. Other extra details.
Environment and execution complements:
1. (Good to have) Instructions to avoid terminal locking (a new tab, locked max-resources, tmux).
2. (Good to have) With progress indication in terminal.

mlcommons / algorithmic-efficiency