nfdi4plants / ARCitect

Cross-plattform app to create and manage ARCs.
9 stars 6 forks source link

[BUG] LFS i/o timeout for large files #189

Closed Hannah-Doerpholz closed 5 days ago

Hannah-Doerpholz commented 1 month ago

OS and framework information (please complete the following information):

Describe the bug

JonasLukasczyk commented 1 month ago

Thank you for posting this issue. This issue seems to be limited to LFS and has also been described here. The proposed solution is to increase the LFS timeout: git config lfs.activitytimeout 60

Please test if this works. If so, we might want to increase this value in ARCitect by default.

Hannah-Doerpholz commented 1 month ago

I tried it with a higher activitytimeout but I still can't upload large files. I tested pushing a large file using normal git + git lfs through the CLI with a file of 2.4 GB. That worked just fine. When I tried again with a larger file of 4.9 GB, I again have the same upload problems. The terminal says "Uploading LFS objects: 0% (0/1), 10 GB | 24 MB/s" (while still uploading). I'm not sure where the 10 GB are supposed to come from, I only added that one 4.9 GB file. The only changes with this commit are the addition of that file and the addition of a new line in the .gitattributes since this file would be tracked by LFS.

I actually think this might be an issue with LFS itself. As noted here it seems that files above (and maybe also close to) 5GB will be rejected.

Update: It did run through with this output:

Uploading LFS objects: 100% (1/1), 24 GB | 24 MB/s, done.                       
Enumerating objects: 10, done.
Counting objects: 100% (10/10), done.
Delta compression using up to 8 threads
Compressing objects: 100% (6/6), done.
Writing objects: 100% (6/6), 769 bytes | 769.00 KiB/s, done.
Total 6 (delta 2), reused 0 (delta 0), pack-reused 0

I still don't understand how I got to a total of 24 GB though

JonasLukasczyk commented 1 month ago

ARCitect is essentially just running git and lfs commands in the terminal and forwarding the terminal output. So it is weird that you see different terminal outputs in ARCitect and if you do it manually in the terminal. We added some additional git features in the latest main branch that by chance fix this issue as well. I will put together a release and then you could test if this problem still exists.

Hannah-Doerpholz commented 1 month ago

Maybe I didn't describe it right. I still get the same output in ARCitect as in the terminal. I just have problems pushing really large files. To check, if the issue is with the file or ARCitect I first try to push large files from the terminal (for now only the 2.4 GB file and the 4.9 GB file). When I tried pushing a 21 GB file from either the ARCitect or the terminal I get the i/o error. That's why I now try to check increasing file sizes to find out what the problem might be

ZimmerD commented 3 weeks ago

One of our cooperation partners also reported a problem uploading larger files on their windows 10 system within their university network. Whenever they tried to upload a file with a size of ~4.3 gb they received an error. I asked them to send me their data and reproduced the issue in a newly created ARC at our university using a windows 11 system and our university (KL) network.

The observed behavior sounds fairly similar to what @Hannah-Doerpholz is describing. Thus, I chose this issue to described to what I have done so far, feel free to move this post to a separate one.

First, I initialized an empty ARC with version 0.0.35. After creating one study and one assay I pushed this ARC to the datahub. Its now available at: https://git.nfdi4plants.org/zimmer/LFSLargeFilesUploadTest3

This is what I have tested so far:

OS: windows 11, Arcitect version 0.0.35 After initialization I created three files using the following commands to create a ~1gb, ~5gb and ~10gb file for testing, which can be reproduced for local testing: fsutil file createnew C:\mypath\1gb.txt 1073741824 fsutil file createnew C:\mypath\5gb.txt 5073741824 fsutil file createnew C:\mypath\10gb.txt 10737418240

Then I added the 1 gb file to the architect, committed and attempted to sync my changes with the datahub. This was done successfully with the following output:

git branch
* main
git remote set-url origin https://oauth2:4b99a3038700fe158fc200fb05b23acca132f5377426ae56c8c1c3f60d7acb70@git.nfdi4plants.org/zimmer/LFSLargeFilesUploadTest3.git
git push --verbose --atomic --progress origin main
Pushing to https://git.nfdi4plants.org/zimmer/LFSLargeFilesUploadTest3.git
Locking support detected on remote "origin". Consider enabling it with:
$ git config lfs.https://oauth2:4b99a3038700fe158fc200fb05b23acca132f5377426ae56c8c1c3f60d7acb70@git.nfdi4plants.org/zimmer/LFSLargeFilesUploadTest3.git/info/lfs.locksverify true
Uploading LFS objects: 100% (1/1), 1.1 GB | 9.0 MB/s, done.
Enumerating objects: 11, done.
Counting objects: 100% (11/11), done.
Delta compression using up to 32 threads
Compressing objects: 100% (7/7), done.
Writing objects: 100% (7/7), 688 bytes | 688.00 KiB/s, done.
Total 7 (delta 2), reused 0 (delta 0), pack-reused 0
POST git-receive-pack (878 bytes)
To https://git.nfdi4plants.org/zimmer/LFSLargeFilesUploadTest3.git
8a2ca2f..69d19ff main -> main
updating local tracking ref 'refs/remotes/origin/main'
git remote set-url origin https://git.nfdi4plants.org/zimmer/LFSLargeFilesUploadTest3.git

Then I added the 5 gb file to the architect, committed and attempted to sync my changes with the datahub. This created the following output including error:


git branch
* main
git remote set-url origin https://oauth2:4b99a3038700fe158fc200fb05b23acca132f5377426ae56c8c1c3f60d7acb70@git.nfdi4plants.org/zimmer/LFSLargeFilesUploadTest3.git
git push --verbose --atomic --progress origin main
Pushing to https://git.nfdi4plants.org/zimmer/LFSLargeFilesUploadTest3.git
Locking support detected on remote "origin". Consider enabling it with:
$ git config lfs.https://oauth2:4b99a3038700fe158fc200fb05b23acca132f5377426ae56c8c1c3f60d7acb70@git.nfdi4plants.org/zimmer/LFSLargeFilesUploadTest3.git/info/lfs.locksverify true
Uploading LFS objects: 0% (0/1), 41 GB | 24 MB/s, done.
LFS: Put "https://git.nfdi4plants.org/zimmer/LFSLargeFilesUploadTest3.git/gitlab-lfs/objects/b3a4508d730e9916a4487a0bc512894a5f0657cd0c665927a5c7ec736c8fa1c7/5073741824": read tcp 131.246.45.12:60931->132.230.102.154:443: i/o timeout
error: failed to push some refs to 'https://git.nfdi4plants.org/zimmer/LFSLargeFilesUploadTest3.git'
git remote set-url origin https://git.nfdi4plants.org/zimmer/LFSLargeFilesUploadTest3.git

As we see, this error sounds familiar to what @Hannah-Doerpholz is describing and is equal to the error our cooperation partners where seeing. The attempted upload is described to be 41GB in size, whereas the file to be uploaded is only 5gb big.

Afterwards I followed this and also other discussions and increased the lfs activity timeout in the arc by executing the following command:

git config lfs.activitytimeout 3600

Afterwards I attempted again to push my commit to the datahub. This time the process was successful and accompanied by this message:


git branch
* main
git remote set-url origin https://oauth2:dcc20b96ec2f11dd23acff53bc61a568557871b5778092410f3d8b18e1d98428@git.nfdi4plants.org/zimmer/LFSLargeFilesUploadTest3.git
git push --verbose --atomic --progress origin main
Pushing to https://git.nfdi4plants.org/zimmer/LFSLargeFilesUploadTest3.git
Locking support detected on remote "origin". Consider enabling it with:
$ git config lfs.https://oauth2:dcc20b96ec2f11dd23acff53bc61a568557871b5778092410f3d8b18e1d98428@git.nfdi4plants.org/zimmer/LFSLargeFilesUploadTest3.git/info/lfs.locksverify true
Uploading LFS objects: 100% (1/1), 5.1 GB | 23 MB/s, done.
Enumerating objects: 12, done.
Counting objects: 100% (12/12), done.
Delta compression using up to 32 threads
Compressing objects: 100% (7/7), done.
Writing objects: 100% (7/7), 701 bytes | 701.00 KiB/s, done.
Total 7 (delta 2), reused 0 (delta 0), pack-reused 0
POST git-receive-pack (891 bytes)
To https://git.nfdi4plants.org/zimmer/LFSLargeFilesUploadTest3.git
69d19ff..d2d9bed main -> main
updating local tracking ref 'refs/remotes/origin/main'
git remote set-url origin https://git.nfdi4plants.org/zimmer/LFSLargeFilesUploadTest3.git

Note that this also could be an coincident as I figured that sometimes I was able to push a 5gb file. Most of the times my tries were unsuccessful.

However, even with activity timeout raised I never successfully was able to push a 10gb file. Which errored either with this:

git branch
* main
git remote set-url origin https://oauth2:dcc20b96ec2f11dd23acff53bc61a568557871b5778092410f3d8b18e1d98428@git.nfdi4plants.org/zimmer/LFSLargeFilesUploadTest3.git
git push --verbose --atomic --progress origin main
Pushing to https://git.nfdi4plants.org/zimmer/LFSLargeFilesUploadTest3.git
Locking support detected on remote "origin". Consider enabling it with:
$ git config lfs.https://oauth2:dcc20b96ec2f11dd23acff53bc61a568557871b5778092410f3d8b18e1d98428@git.nfdi4plants.org/zimmer/LFSLargeFilesUploadTest3.git/info/lfs.locksverify true
Fatal error: Server error: https://git.nfdi4plants.org/zimmer/LFSLargeFilesUploadTest3.git/gitlab-lfs/objects/732377e7f4a2abdc13ddfa1eb4c9c497fd2a2b294674d056cf51581b47dd586d/10737418240
Uploading LFS objects: 0% (0/1), 0 B | 35 MB/s, done.
error: failed to push some refs to 'https://git.nfdi4plants.org/zimmer/LFSLargeFilesUploadTest3.git'
git remote set-url origin https://git.nfdi4plants.org/zimmer/LFSLargeFilesUploadTest3.git

Or this error message:

git branch
* main
git remote set-url origin https://oauth2:060ca76155de75a6d3ed1bfea83a5b07d7cc1cf8a90665de3973cbbfcd30b792@git.nfdi4plants.org/zimmer/LFSLargeFilesUploadTest3.git
git push --verbose --atomic --progress origin main
Pushing to https://git.nfdi4plants.org/zimmer/LFSLargeFilesUploadTest3.git
warning: current Git remote contains credentials
Locking support detected on remote "origin". Consider enabling it with:
$ git config lfs.https://oauth2:060ca76155de75a6d3ed1bfea83a5b07d7cc1cf8a90665de3973cbbfcd30b792@git.nfdi4plants.org/zimmer/LFSLargeFilesUploadTest3.git/info/lfs.locksverify true
warning: current Git remote contains credentials
batch response: Authentication required: Authorization error: https://oauth2:060ca76155de75a6d3ed1bfea83a5b07d7cc1cf8a90665de3973cbbfcd30b792@git.nfdi4plants.org/zimmer/LFSLargeFilesUploadTest3.git/info/lfs/objects/batch
Check that you have proper access to the repository
Uploading LFS objects: 0% (0/1), 0 B | 18 MB/s, done.
error: failed to push some refs to 'https://git.nfdi4plants.org/zimmer/LFSLargeFilesUploadTest3.git'
git remote set-url origin https://git.nfdi4plants.org/zimmer/LFSLargeFilesUploadTest3.git

At this point I would I can not rule out if the problem is on the client or on the server side. However I think in the past it was possible to commit such large files, so maybe it is a configuration issue. I think its crucial to address this problem as this currently hinders the use of the ARCitect and the datahub @JonasLukasczyk, @HLWeil.

HLWeil commented 3 weeks ago

Could replicate this. Especially weird are the long upload times. For me, when uploading the 10GB file, my 2h GitLab access token expired.

Contacted @j-bauer, to check for the possibility of server issues.

j-bauer commented 2 weeks ago

I could also reproduce the error. Strangely, I remember that we had this issue a while back but had a fix/workaround for it.

It is related to the way GitLab uploads and hashes LFS files into the S3 storage.

I will try to get behind the issue and will post updates.

j-bauer commented 2 weeks ago

A potential fix has been created. In order to apply it, I need to restart the DataHUB instance which I don't want to do during business hours/days. I will restart and deploy the fix tomorrow evening. If that proves succesful, I will give more details about what the issue was.

j-bauer commented 2 weeks ago

Deployed the fix and it indeed seemed to fix the problem with a single 10GB file. Feel free to test uploading large/larger files again and please report your findings.

I will post a more in depth report on monday but I thought that large file transfers are something one might want to trigger over the weekend ;-)

ZimmerD commented 2 weeks ago

Thank you! I started some tests this morning:

OS: Windows 11 Software: ARCitect 0.0.36

Push failed with:


git branch
* main
git remote set-url origin https://oauth2:c9e3795edb187c31dd19bedd9899510b3209f5e68878e002c40baf2f35fde596@git.nfdi4plants.org/zimmer/10gbBatchTest.git
git push --verbose --atomic --progress origin main
Pushing to https://git.nfdi4plants.org/zimmer/10gbBatchTest.git
Locking support detected on remote "origin". Consider enabling it with:
$ git config lfs.https://oauth2:c9e3795edb187c31dd19bedd9899510b3209f5e68878e002c40baf2f35fde596@git.nfdi4plants.org/zimmer/10gbBatchTest.git/info/lfs.locksverify true
warning: current Git remote contains credentials
Uploading LFS objects: 0% (0/1), 186 GB | 63 MB/s, done.
batch response: Authentication required: Authorization error: https://oauth2:c9e3795edb187c31dd19bedd9899510b3209f5e68878e002c40baf2f35fde596@git.nfdi4plants.org/zimmer/10gbBatchTest.git/info/lfs/objects/batch
Check that you have proper access to the repository
error: failed to push some refs to 'https://git.nfdi4plants.org/zimmer/10gbBatchTest.git'
git remote set-url origin https://git.nfdi4plants.org/zimmer/10gbBatchTest.git
Hannah-Doerpholz commented 2 weeks ago

That sounds more like your access token expired or doesn't have the correct permissions to write files to that repo. Did you log in correctly to git with it? Or maybe you need to update your credential manager if you're using that.

ZimmerD commented 2 weeks ago

The error message is provided to me by the ARCitect. I assumed that everything concerning git is managed by the ARCitect and did not configure anything beyond what is available through the GUI for this test.

j-bauer commented 2 weeks ago

This is what I could find in the logs:

That seemed to work at 11:53

Started POST "/zimmer/10gbBatchTest.git/info/lfs/objects/batch" for 131.x.x.xat 2024-06-30 11:53:29 +0000
Processing by Repositories::LfsApiController#batch as JSON
  Parameters: {"operation"=>"upload", "objects"=>[{"oid"=>"f0b14a8da7f1c48a0846647a078b97956edd8df451a62fc4b466879aa24d4fd7", "size"=>107374182400}], "transfers"=>["lfs-standalone-file", "basic", "ssh"], "ref"=>{"name"=>"refs/heads/main"}, "hash_algo"=>"sha256", "repository_path"=>"zimmer/10gbBatchTest.git", "lfs_api"=>{"operation"=>"upload", "objects"=>[{"oid"=>"f0b14a8da7f1c48a0846647a078b97956edd8df451a62fc4b466879aa24d4fd7", "size"=>107374182400}], "transfers"=>["lfs-standalone-file", "basic", "ssh"], "ref"=>{"name"=>"refs/heads/main"}, "hash_algo"=>"sha256"}}
Completed 200 OK in 66ms (Views: 0.3ms | ActiveRecord: 14.4ms | Elasticsearch: 0.0ms | Allocations: 12939)

Then at 12:22 it didn't with a 401 (Unauthorized):

Started POST "/zimmer/10gbBatchTest.git/info/lfs/objects/batch" for 131.x.x.x at 2024-06-30 12:22:22 +0000
Processing by Repositories::LfsApiController#batch as JSON
  Parameters: {"operation"=>"upload", "objects"=>[{"oid"=>"f0b14a8da7f1c48a0846647a078b97956edd8df451a62fc4b466879aa24d4fd7", "size"=>107374182400}], "transfers"=>["ssh", "lfs-standalone-file", "basic"], "ref"=>{"name"=>"refs/heads/main"}, "hash_algo"=>"sha256", "repository_path"=>"zimmer/10gbBatchTest.git", "lfs_api"=>{"operation"=>"upload", "objects"=>[{"oid"=>"f0b14a8da7f1c48a0846647a078b97956edd8df451a62fc4b466879aa24d4fd7", "size"=>107374182400}], "transfers"=>["ssh", "lfs-standalone-file", "basic"], "ref"=>{"name"=>"refs/heads/main"}, "hash_algo"=>"sha256"}}
Filter chain halted as :authenticate_user rendered or redirected
Completed 401 Unauthorized in 63ms (Views: 3.8ms | ActiveRecord: 29.3ms | Elasticsearch: 0.0ms | Allocations: 6169)

So it seems that Hannah might be right. Not sure how the ARCitect git client is fetching its credentials/access token? If it is using an OAuth token: these have an expiry date and need to be refreshed. There should be a refresh_token in the JWT token which is used to get a fresh access_token in case it expired. But it depends how ARCitect is internally getting tokens. But that might be the source of the problem.

I will try to upload a 100GB using the git client just to compare and will report back.

ZimmerD commented 2 weeks ago

@JonasLukasczyk What do you think about the last error regarding the Upload of the 100gb file, do you think this is an inherent problem of the way the ARCitect retrieves tokens or related to a different problem?

JonasLukasczyk commented 2 weeks ago

In short only the current token is used to initialize the LFS transfer and when it expires during the upload it will get stuck. I recommend as a short term solution to increase the default token expiration. I'm not sure LFS can handle a token change during a transfer. Here are some relevant references:

ZimmerD commented 2 weeks ago

@HLWeil @j-bauer If I recall correctly similar problems were also encountered when first uploading large files with the ARCCommander and increasing the default token expiration was not an option here - please correct me if I am wrong, maybe I am not remembering this right.

@JonasLukasczyk For the ARCommander one of the solutions was to use create and store an project access token or an personalized access tokens. Would it be a short term solution for the ARCitect to integrate the possibility to also store such tokens? The tutorials are already there and I could document this in a video guide.

j-bauer commented 2 weeks ago

In short only the current token is used to initialize the LFS transfer and when it expires during the upload it will get stuck. I recommend as a short term solution to increase the default token expiration. I'm not sure LFS can handle a token change during a transfer. Here are some relevant references:

* https://forum.gitlab.com/t/git-lfs-authentication-expire-after-30-minutes-big-file-upload-fails-with-401-error/16060

* [Relative expiration times git-lfs/git-lfs#2125](https://github.com/git-lfs/git-lfs/issues/2125)

* [Support the `expires_at` property git-lfs/git-lfs#1345](https://github.com/git-lfs/git-lfs/issues/1345)

We have increased the LFS token lifespan for a while now. The problem is elsewhere.

j-bauer commented 2 weeks ago

@JonasLukasczyk What do you think about the last error regarding the Upload of the 100gb file, do you think this is an inherent problem of the way the ARCitect retrieves tokens or related to a different problem?

I also cannot upload a 100GB file using git directly currently, however I don't get a 401 error but a 500. In my case, it is running nto a postgres timeout which I'm still tuning.

JonasLukasczyk commented 2 weeks ago

@ZimmerD Currently the oauth token expires in 2h. Did you see any errors before then?

JonasLukasczyk commented 2 weeks ago

To dig deeper you can get more LFS details by setting the following environment variables:

GIT_TRACE = 1; GIT_CURL_VERBOSE = 1; GIT_TRANSFER_TRACE = 1;

It seems the bug is not in ARCitect. It is probably related to the server storage setup and the authentification. So to investigate this we should execute the LFS commands in the terminal and not through ARCitect.

j-bauer commented 2 weeks ago

So we have 2 problems at hand.

Large uploads in LFS

The core of the issue here is the way GitLab handles the upload of LFS files. The LFS files are stored within the storage backend (independently of whether its a filesystem or S3) with the hash as their filename/object name. GitLab therefore needs to know the hash of the entire file before storing it in its final location.

Now prior to v15.x an LFS file was temporarily stored on the host filesystem (for hashing purposes) before being uploaded to the S3 storage. Once the hash was calculated, it was pushed directly to its final location within S3.

Since some update, they changed that behaviour when an S3 storage is configured for the LFS objects. Now, the file is no longer cached locally on the filesystem but is uploaded to a temporary location on the S3 storage. When the client-side upload is complete and GitLab has the full hash, it will then trigger a move operation to move the LFS object from the temporary location to its final location within S3 - also known as server-side copy.

A lot of components (gitlab-rails, gitlab-workhorse) are involved in the task management of such processes and have timeouts for various steps (e.g. the upload process itself, the data copying process). I don't want to go too deep into details, but each upload to S3 (both the temporary one and the final one) are done using multi-part uploads. Multi-part uploads can be done in various way which impacts their speed and efficiency: chunk size, threads, etc. We managed to find where this is configured within GitLab and increased these values to increase the speed at which the large files are transfered/copied.

However, the timeout of the processes handling the upload/copy tasks also have an impact on the outcome. These timeouts can be tuned and have been generously (some from 2h to 24h) increased and have solved the I/O timeouts - at least in my tests. I'm still having a problem with a PostgreSQL socket timeout error, which is configurable (which we did) but doesn't seem to take effect. This is currently the only error I see when trying to upload a 100GB file. I'm still investigating this but I'm confident we will solve that too.

As a closing remark on this topic: there will be some size limits (or rather upload time limits) - this is unavoidable. The question is how high the limit is, when we fixed what we could.

Token expiration

The 401 errors reported by @ZimmerD are definitely caused by some token expiring. When a git client pushes a commit, the initial authentication uses the GitLab-OAuth token received by the ARCitect during the login. This one expires after 2h. If there are some LFS files, the git client authenticates against the LFS server and retrieves an LFS access token (which is different than the GitLab-OAuth token). This LFS access token used to expires after 2h but we increased that to 24h (see: https://github.com/nfdi4plants/DataHUB/blob/main/scripts/17.1.1-ee.0/patches/lfs_token.rb.patch).

Now from the logs, it is not clear whether the 401 is coming from the GitLab or from the LFS server.

Note that 2h for the GitLab-OAuth token are only relevant for non-LFS operations. It was always the case, even when the upload of large LFS files in prior versions worked. If there is some non-LFS operation that is triggered after, for example a large LFS upload that takes more than 2h, then the GitLab-OAuth token would have expired. I'm fairly sure that using the git client, non-LFS operations are done first and then LFS operations are done last (using the now 24h expiration token for the LFS server). So there is no non-LFS operations after large file transfers. Thus, I'm not sure if that is a potential troublemaker but we will try to increase that as well. It largely depends on the way the ARCitect/arc-commander are interacting with GitLab in the end.

I cannot reproduce the above mentioned issue. I run into the PSQL problems as said earlier, but not into 401 errors. That is while using the git client on a Linux machine. I will continue to debug this issue and hope to come up with a solution fast. When the upload of a 100GB file works reliably using the git client, we can see if the same upload using the ARCitect is also fixed or if the problem lies elsewhere.

JonasLukasczyk commented 2 weeks ago

Thank you all for the detailed analysis. @j-bauer are the new expiration dates already updated? If yes @ZimmerD please rerun your tests. Today I was successfully pushing up to 30Gb files to the DataHub, but they all went through below 2h. Now I'm uploading a 100Gb file to go past the 2h github token limit.

j-bauer commented 2 weeks ago

After a quick talk with @JonasLukasczyk it turns out, that the 2h oauth is indeed the problem. In my tests I wasn't using an oauth token but a personal access token (PAT). That explains the difference. This 2h limit cannot be changed either:

https://docs.gitlab.com/ee/integration/oauth_provider.html#access-token-expiration

The idea is now to use the oauth token to get a PAT and use that for the transfer. That should work.

JonasLukasczyk commented 2 weeks ago

Just for the record I want to post here that one of the two issues is indeed the token expiration. I tested this by starting an LFS upload and shortly after revoke the token via the datahub API. It seems as soon as LFS wants to upload the next chunk it runs into an authentication issue. As reported by @j-bauer we can not increase the 2h token limit, so now the plan is to first retrieve the 2h token as usual, and then use this token to automatically create or fetch an existing unlimited personal access token. Then the new token will be used during the LFS transfer.

j-bauer commented 1 week ago

Ok good news about the 100GB upload test: finally found out how to set the postgresql timeout that was being triggered while the temporary LFS file on S3 was copied to its final location and I just successfully completed a 100GB upload.

Note: That was with a personal access token using the git tools directly though.

HLWeil commented 1 week ago

Ok good news about the 100GB upload test: finally found out how to set the postgresql timeout that was being triggered while the temporary LFS file on S3 was copied to its final location and I just successfully completed a 100GB upload.

Note: That was with a personal access token using the git tools directly though.

Indeed good news!

So with the ARCitect being able to handle/store Personal Access Tokens, the limit might be pushed quite a bit.

One naive question, @j-bauer: Given the fixes you implemented and me using a Personal Access Token, are there still any hard, currently unavoidable size limits? If so, maybe we could document them somewhere and include notifications for users in our tools when they add files larger than this limit.

j-bauer commented 1 week ago

Ok good news about the 100GB upload test: finally found out how to set the postgresql timeout that was being triggered while the temporary LFS file on S3 was copied to its final location and I just successfully completed a 100GB upload. Note: That was with a personal access token using the git tools directly though.

Indeed good news!

So with the ARCitect being able to handle/store Personal Access Tokens, the limit might be pushed quite a bit.

One naive question, @j-bauer: Given the fixes you implemented and me using a Personal Access Token, are there still any hard, currently unavoidable size limits? If so, maybe we could document them somewhere and include notifications for users in our tools when they add files larger than this limit.

As said, the limits are based on various timeouts. Most of them are now very high though. The actual size limit depends on many other factors, like upload speeds, single large vs. many large files in a single commit etc. We just need to test more to see what the current size limits look like.

JonasLukasczyk commented 1 week ago

136157d934d760adc5f5d850fd31a8a6f115df15 solves the part with the access token. This commit adds the following functionality to the git remote management:

  1. A tooltip that informs the user that if LFS is enabled long up- and download times require an access token: image

  2. The add remote dialog now has an optional token field. On hovering over the question mark button the user sees the same tooltip as described previously and by pressing the button the browser will open the help docs that describe the creation of access tokens. Pressing the key button will immediately open the key generation page of the corresponding datahub. We should update the docs to recommend to choose appropriate expiration times and that only the read/write repository scopes are required. image

  3. If a remote has an access token then this is indicated in the list of remotes: image

ZimmerD commented 1 week ago

@JonasLukasczyk @j-bauer This is great, thanks for the detailed analysis and adaptions! I will test the new functionality and provide feedback

j-bauer commented 1 week ago

The DataHUB was redeployed for the update. A couple of configuration options changed and so I needed to adapt the fixes. This was done however I will test a 100GB upload again to be sure the fixes still work.

j-bauer commented 1 week ago

The DataHUB was redeployed for the update. A couple of configuration options changed and so I needed to adapt the fixes. This was done however I will test a 100GB upload again to be sure the fixes still work.

The fixes haven't taken effect, I need to restart the instance for them to apply. I will do that this evening.

j-bauer commented 5 days ago

The restart seem to have helped - I just successfully uploaded a 101GB file. FYI: I needed to use a file with a different hash, hence the slight increase in size, as GitLab knows a file with the hash already exists in S3 and just skipped the entire upload process afaik. Please test and report your results.

JonasLukasczyk commented 5 days ago

Seems like this at least works now. Next we can improve the performance if the upload is still slow. But we will track this in another issue.