Closed denisri closed 1 year ago
I tried git-filter-repo (https://github.com/newren/git-filter-repo/) which allows to filter full repositories and write to a new one. For instance I tried, from a new directory:
git clone https://github.com/populse/populse_mia.git
git init populse_mia_filtered
git filter-repo --source populse_mia --target populse_mia_filtered --path data_tests/ --path python/populse_mia/data_tests --path-glob '*MRIFile*' --path-glob '*.jar' --path resources --path python/populse_mia/ressources --invert-paths
git init populse_data
git filter-repo --source populse_mia --target populse_data --path data_tests/ --path python/populse_mia/data_tests --path resources --path python/populse_mia/ressources
This creates two new repositories in local directories populse_mia_filtered
and populse_data
. The first one contains the code, with files and history from data_tests
, resources
and MRIFileManager
removed (MRIFileManager
was removed but remained in the history).
git lfs
was already used for large data, and the conversion seems to keep them this way.
Here it is a test trial, I have split the repository in 2, but we may use 3 if we want to have one relatively small for tests (I think resources
is used for unit tests, not data_tests
, is that right ?)
I end up with a populse_mia_filtered
repo taking 213 MB of disk space (essentially due to docs and their history I think, which could probably also be filtered out), and populse_data
taking 518 MB after conversion, and 2.5 GB after git lfs fetch
/ git lfs checkout
.
If this is OK we will need at some point to completely replace the existing repository with the filtered one: this means that everyone will have to push his changes, then stop working on it during the conversion, then completely clone the repo again.
First we have to decide:
data_tests
and resources
)resources
), one for data_tests
I tend to incline to solution 2.2. in order to have a "minimal" code repo and a "small enough" test data repo (that can be cloned in appveyor tests for instance) but other options are OK for me (except 1.).
Sorry, I didn't have time to look seriously at this ticket before...
I think resources is used for unit tests, not data_tests, is that right ?
Exactly. data_tests is only used to provide data to the users. resources is used for UTs. In resources we could also remove resources/spm12 which logically should be transferred to mia_processes (I've been thinking about it for a long time but never did it...).
So if we want to do things right, I agree with you (solution 2.2), we should have a light code repo (without resources or data_tests): populse_mia (maybe this would be the occasion to rename populse_mia to only mia, the historical reasons for naming it populse_mia having completely disappeared) and two other data repo.
So we would have 3 new repos:
The names mia_data4UTs and mia_data4users are obviously debatable.
There is already one mia in pypi ... Maybe it's not a good idea to change the name populse_mia to mia ?
The size of populse_mia is now 150 MB. Data for users and for unit tests are now saved outside GitHub.
The populse_mia project on github contains a few MB of code, and about 1GB of test / demo data, which grows more than twice when cloning the repository, thus a populse_mia clone is about 2.5 GB, which is too much, and slows down some tests which involve cloning a new repos. I think we should move away the data in a separate repository, maybe "mia_data", or "populse_mia_data", or "mia_test_data"... We could use git lfs for that, and we would have to clean the git history of populse_mia to actually free the size of data.