populse / populse_mia

Multiparametric Image Analysis
Other
10 stars 9 forks source link

[general need] Move data in a separate github repository #260

Closed denisri closed 1 year ago

denisri commented 2 years ago

The populse_mia project on github contains a few MB of code, and about 1GB of test / demo data, which grows more than twice when cloning the repository, thus a populse_mia clone is about 2.5 GB, which is too much, and slows down some tests which involve cloning a new repos. I think we should move away the data in a separate repository, maybe "mia_data", or "populse_mia_data", or "mia_test_data"... We could use git lfs for that, and we would have to clean the git history of populse_mia to actually free the size of data.

denisri commented 2 years ago

I tried git-filter-repo (https://github.com/newren/git-filter-repo/) which allows to filter full repositories and write to a new one. For instance I tried, from a new directory:

git clone https://github.com/populse/populse_mia.git
git init populse_mia_filtered
git filter-repo --source populse_mia --target populse_mia_filtered --path data_tests/ --path python/populse_mia/data_tests --path-glob '*MRIFile*' --path-glob '*.jar' --path resources --path python/populse_mia/ressources --invert-paths
git init populse_data
git filter-repo --source populse_mia --target populse_data --path data_tests/ --path python/populse_mia/data_tests --path resources --path python/populse_mia/ressources

This creates two new repositories in local directories populse_mia_filtered and populse_data. The first one contains the code, with files and history from data_tests, resources and MRIFileManager removed (MRIFileManager was removed but remained in the history). git lfs was already used for large data, and the conversion seems to keep them this way. Here it is a test trial, I have split the repository in 2, but we may use 3 if we want to have one relatively small for tests (I think resources is used for unit tests, not data_tests, is that right ?)

I end up with a populse_mia_filtered repo taking 213 MB of disk space (essentially due to docs and their history I think, which could probably also be filtered out), and populse_data taking 518 MB after conversion, and 2.5 GB after git lfs fetch / git lfs checkout.

If this is OK we will need at some point to completely replace the existing repository with the filtered one: this means that everyone will have to push his changes, then stop working on it during the conversion, then completely clone the repo again.

First we have to decide:

  1. Do we split the repository or not ?
  2. If yes:
    • 2.1. using 2 repos
      • 2.1.1. one for the code, one for all data (data_tests and resources)
      • 2.1.2. one for the code and unit tests data (resources), one for data_tests
    • 2.2 using 3 repos.

I tend to incline to solution 2.2. in order to have a "minimal" code repo and a "small enough" test data repo (that can be cloned in appveyor tests for instance) but other options are OK for me (except 1.).

servoz commented 2 years ago

Sorry, I didn't have time to look seriously at this ticket before...

I think resources is used for unit tests, not data_tests, is that right ?

Exactly. data_tests is only used to provide data to the users. resources is used for UTs. In resources we could also remove resources/spm12 which logically should be transferred to mia_processes (I've been thinking about it for a long time but never did it...).

So if we want to do things right, I agree with you (solution 2.2), we should have a light code repo (without resources or data_tests): populse_mia (maybe this would be the occasion to rename populse_mia to only mia, the historical reasons for naming it populse_mia having completely disappeared) and two other data repo.

So we would have 3 new repos:

The names mia_data4UTs and mia_data4users are obviously debatable.

servoz commented 2 years ago

There is already one mia in pypi ... Maybe it's not a good idea to change the name populse_mia to mia ?

servoz commented 1 year ago

The size of populse_mia is now 150 MB. Data for users and for unit tests are now saved outside GitHub.