opencdms / surface

GNU General Public License v3.0
6 stars 4 forks source link

Migrating SURFACE to GitHub #1

Closed isedwards closed 2 years ago

isedwards commented 3 years ago

The surface development history contains large files that cause the repository size to be approaching 1 Gb.

Before migrating to GitHub, the repo can be pruned using a tool like BFG Repo-Cleaner and decissions can be made on the most appropriate place for large files/binaries.

822M    surface/
 │
 ├─  212M    surface/data
 ├─  413M    surface/.git
 ├─  56K     surface/apache
 ├─  2.6M    surface/scritps
 ├─  86M     surface/notebooks
 ├─  328K    surface/proxy
 └─  109M    surface/api

Several errors are reported when trying to push the entire repo with it's full history to GitHub:

remote: warning: File data/shared/backup/fepagro-surface_db-2020-09-27.sql is 72.16 MB; thi
s is larger than GitHub's recommended maximum file size of 50.00 MB
remote: warning: File data/shared/backup/surface_raw_data.tsv is 51.65 MB; this is larger t
han GitHub's recommended maximum file size of 50.00 MB
remote: error: Trace: 13c7e2f960e70e3afb920ba523a3b9e09ce020d61713e56601ef7b6b3901d56d
remote: error: See http://git.io/iEPt8g for more information.
remote: error: File api/fixtures/raw_data.dump is 508.99 MB; this exceeds GitHub's file siz
e limit of 100.00 MB
remote: error: File api/fixtures/2020-03-03.dump.sql is 476.76 MB; this exceeds GitHub's fi
le size limit of 100.00 MB
remote: error: GH001: Large files detected. You may want to try Git Large File Storage - ht
tps://git-lfs.github.com.
To https://github.com/opencdms/surface.git
 ! [remote rejected] develop -> develop (pre-receive hook declined)
error: failed to push some refs to 'https://github.com/opencdms/surface.git'

Since the original repository also contains private information that is specific to the Belize installation, the version with large files removed can be uploaded to the opencdms/surface-demo private repository.

This public repository (opencdms/surface) will be used to create a subset that is released as open-source (with the Belize version migrating to the fully open-source version over time).

JSCesar commented 3 years ago

Hello, I have just committed a new branch called "cleanup" on gitlab repository. We removed some large files from the project, also removed .env and .key files. The secret key was moved to .env file and it's just referenced inside settings.py. Let me know if there is anything else to fix.

isedwards commented 3 years ago

Thank you @JSCesar - it looks like you've managed to clean up a lot of the old files.

@fabiosato, the sensitive data is still in the git history, so it's still possible to go back through the history and see the data, e.g.:

git clone -b cleanup --single-branch https://gitlab.com/fabiosato/surface.git
cd surface
# checkout an earlier commit from before the clean up
git checkout 6a54ee2a
cat proxy/cert.key

Instead of cleaning the history using a tool like BFG Repo-Cleaner should we just start a new git repository and copy the cleaned files to this repo (and upload to GitHub without any of the previous history)?

JSCesar commented 3 years ago

Hello, @isedwards. Thanks for your reply. @cismoski has cleaned the repository with me and removed a lot of files. Could you inspect the cleanup repository again?

isedwards commented 3 years ago

Hi @JSCesar, it looks like you've modified .gitignore (to ignore those specific files in the future)... but they still exist in the history of changes that were made to the code in the past. If you do git checkout 6a54ee2a to checkout a version from before the recent changes then you will see all of the large and sensitive files are still stored in the repository.

Also, since everything still exists in the git history, anyone downloading the repository still needs 822 Mb of disc space - the large files will still causes the errors shown above if we try to upload to GitHub.

I think it's okay to upload to GitHub as long as we don't include any of the git history (so we would no longer have git log for historic changes, only for future changes).

fabiosato commented 3 years ago

@isedwards if you don't think it's an issue let's copy the cleaned files into the new repository.

I will keep the Gitlab repo around just in case we need to check the history.

isedwards commented 3 years ago

Is everyone happy if this first issue (to migrate the code to GitHub) is now closed? The next issue #22 is to make sure NMS of Belize are using the version from GitHub and move all new developments over here.

Let me know if there are any problems, especially relating to how to store the sensitive information that is needed for server deployments - we need to make sure we have a good process in place for when SURFACE is deployed to other organisations and/or the cloud.