mne-tools / mne-lsl

A framework for real-time brain signal streaming with MNE-Python.
https://mne.tools/mne-lsl
BSD 3-Clause "New" or "Revised" License
56 stars 27 forks source link

MAINT: Remove blobs from history #145

Closed larsoner closed 11 months ago

mscheltienne commented 11 months ago

I got the gh-pages branch covered, it's not back to 1 commit, and I removed the large files from the last commit of the main branch. I'm not very confident in editing the git history (I'm rarely using git CLI..), could you have a look at simplifying the history? There are 2 points which could help to clean-up this repo:

I will also remove the 0.x.x release which will go back to gh fcbg-hnp-meeg/bsl.

larsoner commented 11 months ago

could you have a look at simplifying the history? There are 2 points which could help to clean-up this repo:

Yes but to be safe the way I'd approach the problem is to create a new branch clean-main with a different/blob-less history. Then you can examine the blame on GitHub easily, do a diff between it and main (and see there are no changes), etc. I started working on this and saw the following when I looked for large files in history:

git fetch uptsream
$ git checkout main
$ git clean -xdf
$ git reset --hard upstream/main
$ git checkout -b clean-main
$ git rev-list --objects --all |
  git cat-file --batch-check='%(objecttype) %(objectname) %(objectsize) %(rest)' |
  sed -n 's/^blob //p' |
  sort --numeric-sort --key=2 |
  cut -c 1-12,41- |
  $(command -v gnumfmt || echo numfmt) --field=2 --to=iec-i --suffix=B --padding=7 --round=nearest
...
d1b37407f3ed   15MiB dev/.doctrees/environment.pickle
ed30123a0406   16MiB dev/_sources/generated/tutorials/20_stream_receiver_filtered_buffer.rst.txt
13ddbac8308b   16MiB dev/generated/tutorials/20_stream_receiver_filtered_buffer.html
2296feabf4e1   21MiB doc/_static/stream_viewer/stream_viewer.gif
1c03d1004ea5   30MiB dev/.doctrees/generated/tutorials/20_stream_receiver_filtered_buffer.doctree
8a5ab8fad80d   38MiB doc/_static/stream_viewer/stream_viewer.mov
af623416d969   40MiB datasets/sample/sample-ant-raw.fif
e3964ca961e4   40MiB datasets/sample/sample-ant-raw.fif
37c3da0c0853   45MiB dev/.doctrees/environment.pickle

Clearly we can get rid of the environment.pickles and the other stuff you mention above, but it would be good to set up some testing infrastructure to get rid of sample-ant-raw.fif as well.

Can we just use MNE-Python's mne-testing-data? It's overkill but sharing the infrastructure for determining version, downloading it, and caching it with GH actions means lower maintenance burden at the mne-lsl end because if stuff breaks the fix is almost always already in MNE-Python itself.

mscheltienne commented 11 months ago

Thanks, and thanks for sharing the git commands you are using!

For the datasets, that's already done in #151. Using MNE-Python infrastructure was a bit too overkill. Instead, I used a very lightweight version with a 3-line function to generate a registry (checksum) for the files to download, and another 3-line function to download the files with pooch.

mscheltienne commented 11 months ago

.. downloaded from https://github.com/mscheltienne/mne-lsl-datasets.

larsoner commented 11 months ago

... also I'd recommend pushing main as-is before we overwrite it with something like git push origin main:main-bak or whatever you want to call it

mscheltienne commented 11 months ago

I'd recommend pushing main as-is before we overwrite it with something like git push origin main:main-bak

So just making a backup main branch? Done, main-backup.

larsoner commented 11 months ago

Following https://docs.github.com/en/authentication/keeping-your-account-and-data-secure/removing-sensitive-data-from-a-repository it looked like https://rtyley.github.io/bfg-repo-cleaner/ would make this trivial but it needed java so instead I just went the manual way by passing the following paths one by one (iterating with the above file-size command) to git filter-repo --invert-paths --force --path:

Files ``` dev/.doctrees datasets doc/_static/stream_viewer doc/_static/stream_player doc/_static/stream_recorder dev/_static/stream_viewer stable/_static/stream_viewer/stream_viewer.mov dev/generated dev/ stable/ Protocols/cv2.pyd sample/ doc/_static/icon-with-name/ doc/_static/acronym/ examples/sample/mi_left_right-raw.fif libLSL/ neurodecode/hci/glass/bgi-cnbi/.gradle/2.2.1/taskArtifacts/fileSnapshots.bin mne_lsl/lsl/lib/ bsl/lsl/lib/ pycnbi/pylsl/liblsl64.dll doc/_static/icon-with-acronym neurodecode/glass/bgi-cnbi/ bsl/externals/pylsl/lib/ docs/sphinx/build/ docs/sphinx/_build/ bsl/triggers/lpt_libs neurodecode/triggers/lpt/libs/ neurodecode/triggers/libs/ neurodecode/triggers/LptControl_Desktop32.dll neurodecode/hci/glass/bgi-cnbi/.gradle pycnbi/glass/bgi-cnbi neurodecode/hci/glass/bgi-cnbi/.gradle neurodecode/triggers/LptControl_Desktop64.dll neurodecode/triggers/inpout32.dll neurodecode/triggers/inpoutx32.dll pycnbi/triggers/inpout32.dll pycnbi/triggers/inpoutx64.dll Triggers/inpout32.dll Triggers/inpoutx64.dll pycnbi/triggers/LptControl_Desktop32.dll pycnbi/triggers/LptControl_Desktop64.dll Triggers/LptControl_Desktop32.dll Triggers/LptControl_Desktop64.dll Glass/bgi-cnbi/build/intermediates/model_data.bin ```

With this the biggest files are now:

6a62a71fe603   56KiB bsl/externals/pylsl/pylsl.py
bb59ec379891   61KiB doc/_static/logging/flowchart-dark.png
c513a72db843   64KiB doc/_static/install/Advanced_system_settings.png
00ef7182b433  120KiB doc/CNBI Arduino Trigger.pdf
38963b805281  136KiB neurodecode/layout/biosemi_128ch.jpg
cad0e5c1a262  145KiB neurodecode/layout/biosemi_064ch.jpg
5e57d4f11eee  181KiB doc/_static/cli/stream_viewer_backend.png
a662e02c5080  188KiB doc/_static/cli/stream_recorder.png
a17d4ab0950a  303KiB doc/_static/cli/stream_player.png
f29b203a3865  594KiB doc/_static/icon/icon.pdf
0067886ece76  600KiB doc/_static/icon/icon.ai
10d18cc67040  694KiB doc/_static/icon/icon.png
450702b383d1  2.1MiB doc/_static/icon/icon.jpg
7e704112c604  4.8MiB doc/_static/icon/icon.eps
6e067c1c7566  6.3MiB doc/_static/icon/icon.psd

Can you look at the file list above and the branch here to see if it makes sense? If it does I can locally do git push upstream --force clean-main:main 🤞

larsoner commented 11 months ago

https://github.com/larsoner/mne-lsl/tree/clean-main

mscheltienne commented 11 months ago

That looks great, thank you! And thanks for the detailed instructions since I'll be doing it again for the icons when I change them on main (and probably a couple of remaining old files in that list).

larsoner commented 11 months ago

Okay, force-pushed clean-main:main

mscheltienne commented 11 months ago

@larsoner Could you have a second look to confirm that it seems in order? I'm struggling a bit.. I removed a couple more files after I changed the icons/logos.

git rev-list --objects --all |
  git cat-file --batch-check='%(objecttype) %(objectname) %(objectsize) %(rest)' |
  sed -n 's/^blob //p' |
  sort --numeric-sort --key=2 |
  cut -c 1-12,41- |
  $(command -v gnumfmt || echo numfmt) --field=2 --to=iec-i --suffix=B --padding=7 --round=nearest

The command above is giving me 2 different outputs when ran in 2 different terminals, in the same repository.. and I am not figuring out what is going on :disappointed:

Also, the main branch does clone faster; but it seems like the gh-pages branch still holds the blobs in its history and is still taking a long time to retrieve. IIRC, it was created as a new branch from main and not as an empty new branch.

larsoner commented 11 months ago

I should be able to look tomorrow!

mscheltienne commented 11 months ago

Thanks, no hurry :)

larsoner commented 11 months ago

I only see one commit in https://github.com/mne-tools/mne-lsl/tree/gh-pages so I think that's good. I think that the rev-list stuff should ideally be run on a clean clone because it processes all branches at once, and I didn't do that the first time. I just redid it, can you check https://github.com/larsoner/mne-lsl/tree/clean-main and see if it's okay and I'll force-push it to mne-tools:main?

mscheltienne commented 11 months ago

Seems OK, tests are passing. Should be good to go.

I think that the rev-list stuff should ideally be run on a clean clone because it processes all branches at once

I was a bit lost with how it was "sometimes" processing all branches and sometimes not. Thanks for having a second look!

larsoner commented 11 months ago

Okay, force-pushed to main

mscheltienne commented 11 months ago

Thanks! But I'm still lost.. I re-forked the repository to get to a clean state, cloned it (which still required about 110 MiB), and ran again:

git rev-list --objects --all |
  git cat-file --batch-check='%(objecttype) %(objectname) %(objectsize) %(rest)' |
  sed -n 's/^blob //p' |
  sort --numeric-sort --key=2 |
  cut -c 1-12,41- |
  $(command -v gnumfmt || echo numfmt) --field=2 --to=iec-i --suffix=B --padding=7 --round=nearest

I'm still getting:

a103c0537dd8  4.0MiB doc/_static/stream_recorder/stream_recorder_cli.mov
4c2c92ee6e50  4.0MiB doc/_static/icon-with-name/icon-with-name.eps
a94028888490  4.2MiB doc/_static/stream_player/stream_player_cli.mov
47b68132f888  4.4MiB doc/_static/icon-with-acronym/icon-with-acronym.eps
80b3d1b29247  4.4MiB doc/_static/icon-with-acronym/icon-with-acronym.psd
7e704112c604  4.8MiB doc/_static/icon/bsl-icon.eps
13fd71d6a874  4.8MiB doc/_static/icon-with-name/icon-with-name.psd
364be804568a  5.1MiB libLSL/liblsl32-debug.dll
6e067c1c7566  6.3MiB doc/_static/icon/bsl-icon.psd
443f16464fe2  7.0MiB libLSL/liblsl64-debug.dll
10884859b78b  7.4MiB examples/sample/mi_left_right-raw.fif
a09427b159b4  7.4MiB sample/mi_left_right.fif
a68291046bc1  7.4MiB sample/mi_left_right.fif
2adf8579ef11  7.4MiB sample/mi_left_right-raw.fif
32bba913b292   11MiB dev/.doctrees/environment.pickle
c5bc58c3b2f5   13MiB Protocols/cv2.pyd
2296feabf4e1   21MiB doc/_static/stream_viewer/stream_viewer.gif
8a5ab8fad80d   38MiB doc/_static/stream_viewer/stream_viewer.mov

And many other files above 1 MiB, despite your 2 passes and despite my own with git filter-repo and with bfg-repo-cleaner. Do you see the same on your side?

larsoner commented 11 months ago

From https://docs.github.com/en/authentication/keeping-your-account-and-data-secure/removing-sensitive-data-from-a-repository#purging-a-file-from-your-repositorys-history it's probably this:

In order to remove the sensitive file from your tagged releases, you'll also need to force-push against your Git tags

I'll try again!

larsoner commented 11 months ago

Okay I went back through with the removals then did:

larsoner@bunk:~/python/mne-lsl$ git remote add origin git@github.com:/larsoner/mne-lsl.git
larsoner@bunk:~/python/mne-lsl$ git push origin --force --all
Enumerating objects: 10970, done.
...
To github.com:/larsoner/mne-lsl.git
 + d63012c...aebd24c main -> main (forced update)
 * [new branch]      gh-pages -> gh-pages
larsoner@bunk:~/python/mne-lsl$ git push origin --force --tags
Enumerating objects: 10500, done.
...
 * [new tag]         0.6.4 -> 0.6.4
larsoner@bunk:~/python/mne-lsl$ cd ..
larsoner@bunk:~/python$ rm -Rf mne-lsl/
larsoner@bunk:~/python$ git clone git@github.com:/larsoner/mne-lsl.git
Cloning into 'mne-lsl'...
...
Receiving objects: 100% (14850/14850), 4.77 MiB | 24.32 MiB/s, done.
Resolving deltas: 100% (10544/10544), done.
larsoner@bunk:~/python$ cd mne-lsl
larsoner@bunk:~/python/mne-lsl$ du -hs
6.5M    .

So I think it worked. I then went to my gh-pages branch and discarded the 1 commit so that it matched the one here. The size is now ~8MB. Also note:

96a1a32a7959   11MiB dev/.doctrees/environment.pickle

so you could save a tiny bit of git clone by pruning this Sphinx env pickle file when you deploy to gh-pages.

Can you make sure you're happy with the state of larsoner:mne-lsl and if so I'll force push main and tags here? Then things should work :crossed_fingers: :crossed_fingers:

mscheltienne commented 11 months ago

Looks good, diff is empty and tests are passing. And it did clone your fork way faster! 🤞 I'll remove that environment file!

Thanks for looking (again) (again) into this!

larsoner commented 11 months ago

Hah!

$ git remote remove origin  # was larsoner
$ git remote add origin git@github.com:/mne-tools/mne-lsl.git
$ git push origin --force --all
Everything up-to-date
$ git push origin --force --tags
Enumerating objects: 10500, done.
...
 + 194c9a2...3588455 0.6.4 -> 0.6.4 (forced update)

The "Everything up-to-date" when I force-pushed the branches suggests it was just the tags that needed to be updated, which is cool. Mystery solved!

mscheltienne commented 11 months ago

Yes that looks good! That was too sneaky for me.. 😅