mlcommons / algorithmic-efficiency

MLCommons Algorithmic Efficiency is a benchmark and competition measuring neural network training speedups due to algorithmic improvements in both training algorithms and models.
https://mlcommons.org/en/groups/research-algorithms/
Apache License 2.0
332 stars 69 forks source link

Improving package setup to pyproject.toml #803

Open IFFranciscoME opened 1 week ago

IFFranciscoME commented 1 week ago

Description

Improve the approach for packaging and use a pyproject.toml approach

Context

The common approach for a python project to be used is either a direct way, mostly like cloning the repository, and a compacted way, mostly in the form of packaging and/or containerization. In either case, for the same version of the software, the same functionality should be available for any given system independently on how it was installed.

Problem

When the user decides to go for the package/container route of installing the software, there will be dependency issues not straight forward solvable for some cases (more yet to be mapped). A non exhaustive list of these problems is:

Elements for the solution (Draft)

In general, it might me a good opportunity to update the packaging from a setup.py oriented approach to pyproject.toml approach.

PEP 518 – Specifying Minimum Build System Requirements for Python Projects

This PEP specifies how Python software packages should specify what build dependencies they have in order to execute their chosen build system. As part of this specification, a new configuration file is introduced for software packages to use to specify their build dependencies (with the expectation that the same configuration file will be used for future configuration details).

Externally Managed Environments

To allows a Python installation to indicate to Python-specific tools such as pip that they neither install nor remove packages into the interpreter’s default installation environment

Some other details might be useful:

IFFranciscoME commented 5 days ago

Ok, @priyakasimbeg, here is my proposition of more actionable items to start the first phase of refactoring: And is in the Datasets installation/downloading process actually, previous to the overall project.

Problems:

Improvement oportunities:

  1. Move from a monolithic dataset config to a per-data set config.
  2. Move from a dataset_setup.py config to a pyproject.toml logic.
  3. Extend/update with the following:
    1. Keep the ~/data/ and ~/temp/data local folder creation.
    2. Define a pyproject.toml file for all datasets.
    3. Within pyproject.toml specify a dependency list for each dataset.
    4. Create a sub-folder per dataset
      1. Create/relocate/expand downloading, pre-processing, eda scripts.
      2. A README.md for each dataset with some of the following:
        1. Official Name, Creator, License.
        2. Data and File Structure.
        3. Exact Full Size (On decompressed + worst case scenario).
        4. Other extra details.
  4. Environment and execution complements:
    1. (Good to have) Instructions to avoid terminal locking (a new tab, locked max-resources, tmux).
    2. (Good to have) With progress indication in terminal.