go-gitea / gitea

Git with a cup of tea! Painless self-hosted all-in-one software development service, including Git hosting, code review, team collaboration, package registry and CI/CD
https://gitea.com
MIT License
44.08k stars 5.41k forks source link

Gitea dump duplicates repositories on windows #21662

Open eeyrjmr opened 1 year ago

eeyrjmr commented 1 year ago

Description

When comparing the resultant archive produced via "gitea dump" between windows and linux, the windows archive is twice as large.

It appears the bare repositories are duplicated in two locations

gitea-dump-####.zip
    custom
    data
        gitea-repositories
                  repo1
                  repo2       
   repos
      repo1
      repo2

Gitea Version

1.17.2

Can you reproduce the bug on the Gitea demo site?

No

Log Gist

No response

Screenshots

No response

Git Version

2.34

Operating System

windows, linux

How are you running Gitea?

service

Database

MySQL

lunny commented 1 year ago

So which one is not in the right place?

wxiaoguang commented 1 year ago

Maybe it's related to a long-standing bug, you shouldn't run gitea dump in gitea directory.

eeyrjmr commented 1 year ago

Maybe it's related to a long-standing bug, you shouldn't run gitea dump in gitea directory.

Apologies for the delay... Interesting bug. I have just tried this and the result is the same

So which one is not in the right place?

Very good question :) I suspect it is linux but it is likely due to a subtle difference in on-disk file structure. Looking at the restore part of the docs: https://docs.gitea.io/en-us/backup-and-restore/#restore-command-restore

unzip gitea-dump-1610949662.zip
cd gitea-dump-1610949662
mv data/conf/app.ini /etc/gitea/conf/app.ini
mv data/* /var/lib/gitea/data/
mv log/* /var/lib/gitea/log/
mv repos/* /var/lib/gitea/repositories/
chown -R gitea:gitea /etc/gitea/conf/app.ini /var/lib/gitea 

the repositories are meant to be in the root of the gitea working directory as this is where the restore sequence is instructing the user to act.

Looking at the dump generated from gitea running in an Alpine VE I see the structure aligns with this

  1. repos directory in the root of the zip containing the repos/orgs
  2. no additional repos stored within the data directory of the zip

Looking at the dump generated from a gitea running in a windows MS I see a subtle difference

  1. repos directory in the root of the zip containing the repos/org
  2. a gitea-repositories directory under the data directory of the zip.

I noticed this oddity some months ago where the backup zip was larger than the on-disk structure but I didn't look into it. I recently pushed some older git repos to the instance running on windows and the recent backups are growing

on-disk = 708Meg gitea-dump-1668736800.zip = 1,411Meg

the sql dump (the only thing that should be different) is 1Meg in size. I spent a bit of time looking over the dump code but I havn't managed to get my head around how it works to try to understand what it is trying to dump, let alone why it is making this additional directory and only for windows

eeyrjmr commented 1 year ago

I do have this in my app.ini

[repository] ROOT = D:/gitea/data/gitea-repositories

Now thinking about this... could this be related. Looking at: https://docs.gitea.io/en-us/config-cheat-sheet/#repository-repository ROOT: %(APP_DATA_PATH)s/gitea-repositories: Root path for storing all repository data. A relative path is interpreted as AppWorkPath/%(ROOT)s.

So I set this "just in case" based upon the "windows as a service" to include full path: https://docs.gitea.io/en-us/windows-service/

So a running gitea is correctly reading this location. Now the backup... the backup code does two things 1) copies the repositories 2) backs up ./data

since I have repositories in the data subdirectory it is getting archived twice.

So in theory I should be able to comment out the [repository] section, move the D:/gitea/data/gitea-repositories to D:/gitea/gitea-repositories and gitea should keep working but also the gitea dump will be ~ the on-disk size

lunny commented 1 year ago

So should you move repositories out of data or should Gitea check if repositories directory under ./data?

eeyrjmr commented 1 year ago

So should you move repositories out of data or should Gitea check if repositories directory under ./data?

good question :) For consistency I should move repositories out of data as this way following the restore from backup makes sense.

should gitea check if the repositories are under ./data ... looking at the issue @wxiaoguang linked there is some commonality as the migration also put the repositories under ./data. Its extra logic to check and skip

eeyrjmr commented 1 year ago

ok its a bit more involved than that...

I commented out the [repository] entry and ran git dump to test:

2022/11/22 08:41:00 ...les/storage/local.go:46:NewLocalStorage() [I] Creating new Local Storage at D:\gitea\data\packages
Failed to include repositories: open D:\gitea\data\gitea-repositories: The system cannot find the file specified.
2022/11/22 08:41:00 cmd/dump.go:241:runDump() [I] Dumping local repositories... D:\gitea\data\gitea-repositories
2022/11/22 08:41:00 cmd/dump.go:159:fatal() [F] Failed to include repositories: open D:\gitea\data\gitea-repositories: The system cannot find the file specified.

that aside, the archive is back to an expected size

image

techknowlogick commented 1 year ago

re-opening as we've received a similar report via chat

Kalyxt commented 1 year ago

I'll post here additional info.

giteasize

First line is zipped gitea folder which contains entire data, second line is dump created by gitea CLI (1.20.1).

I browsed dump file ale there are duplicated repositories at gitea-dump-1690312222.zip\data\gitea-repositories and gitea-dump-1690312222.zip\repos.

hesseldijk commented 11 months ago

Hi,

Any more information on this? I'm having the same problem (1.20.2)

wxiaoguang commented 5 months ago

When writing #30240 , I think I understand more about the problems now (the "dump" code wasn't written by me, so it really takes a lot of time to understand what it is doing ....)

The root problem is that some directories overlapped. For example: Gitea expects to backup PathA and PathB. But if PathA=C:\git\data and PathB=C:\git\data\sub, then the dumped file contains duplicate files.

At the moment I don't have a clear plan for a complete rewriting. And I can see that the "dump" command has a lot of problems. So a workaround could be "manually copy the data directory and dump the database", it is more flexible and controllable.