Text file metadata for dehydrated files do not match the hydrated metadata

bionicbeagle commented 1 year ago

I noticed that the file size before and after hydration differs. This causes unexpected things to happen in some cases. For example:

 Volume in drive D is Data
 Volume Serial Number is F24C-78F7

 Directory of d:\FN2\Engine\Source

10/03/2023  09:30               421 UnrealClient.Target.cs
10/03/2023  09:30               448 UnrealEditor.Target.cs
10/03/2023  09:30               463 UnrealGame.Target.cs
10/03/2023  09:30             (455) UnrealServer.Target.cs
               4 File(s)          1,787 bytes

d:\FN2\Engine\Source>more UnrealServer.Target.cs
// Copyright Epic Games, Inc. All Rights Reserved.

using UnrealBuildTool;
using System.Collections.Generic;

[SupportedPlatforms(UnrealPlatformClass.Server)]
public class UnrealServerTarget : TargetRules
{
    public UnrealServerTarget(TargetInfo Target) : base(Target)
        {
                Type = TargetType.Server;
                IncludeOrderVersion = EngineIncludeOrderVersion.Latest;
                BuildEnvironment = TargetBuildEnvironment.Shared;
                ExtraModuleNames.Add("Unreal

d:\FN2\Engine\Source>more UnrealServer.Target.cs
// Copyright Epic Games, Inc. All Rights Reserved.

using UnrealBuildTool;
using System.Collections.Generic;

[SupportedPlatforms(UnrealPlatformClass.Server)]
public class UnrealServerTarget : TargetRules
{
    public UnrealServerTarget(TargetInfo Target) : base(Target)
        {
                Type = TargetType.Server;
                IncludeOrderVersion = EngineIncludeOrderVersion.Latest;
                BuildEnvironment = TargetBuildEnvironment.Shared;
                ExtraModuleNames.Add("UnrealGame");
        }
}

d:\FN2\Engine\Source>dir Unreal*
 Volume in drive D is Data
 Volume Serial Number is F24C-78F7

 Directory of d:\FN2\Engine\Source

10/03/2023  09:30               421 UnrealClient.Target.cs
10/03/2023  09:30               448 UnrealEditor.Target.cs
10/03/2023  09:30               463 UnrealGame.Target.cs
10/03/2023  09:30               471 UnrealServer.Target.cs
               4 File(s)          1,803 bytes
               0 Dir(s)  3,100,296,794,112 bytes free

As you can see, the unhydrated file size is reported as 455 and the hydrated file size is 471. The difference is presumably because p4 stores text files in a normalized form on the server and transcodes it according to client settings as part of the syncing process.

The end result is that in this case more will only do a partial read of the file the first time, but the second time it reads the entire file. Many other processes still function as expected, which I assume means they just keep reading until EOF instead of checking the file size up front.

santagada commented 1 year ago

When looking at this driver I was very curious how Microsoft solved this problem... so apparently there's no fix for it yet.

When we needed something like that we created a post submit trigger called crlfsizer.py that counted the actual size of files by counting end of lines by p4printing all files in the cl and then adding the crlf size as an attribute to all changelists/files.

A bit messy but I think its the only possible solution for this.

jspelletier commented 1 year ago

In our tools we often get the size of the file and then read the file in memory to process it. Those tools are failing when the file is virtualized as the files are truncated due to the file size being smaller.

jessk-msft commented 1 year ago

Unfortunately this is expected behaviour for non-binary file types. The size of hydrated file on the PC may differ from the filesize from the perforce file metadata, depending on the workspace encoding options. This is caused by the file BOM and line end character selection. A proper solution would likely require support on the Perforce server. At Microsoft we've worked around this by:

If the tool needs the exact file size, it must first open file for read.

jspelletier commented 1 year ago

I see two solutions

could do some p4 print to get the file sizes and store this locally
Could change the p4 sync -k for some p4 sync without -k that would discard the received file but use the received file to compute the final size and store this locally.(this is similar to the p4 print idea but doing this at the same time as the sync is executed)

Could also store these in a server somewhere optionally to avoid multiple users computing this.

Note: We have a similar tool at ubisoft and we are doing p4 print in background to compute the file sizes.

jessk-msft commented 1 year ago

That's a really nice solution jspelletier. I'm not sure how it could be done quickly and atomically during the P4VFS sync process. We have also considered having a perforce trigger do something similar for text files in a change-commit trigger, and write the size/revision info to another database. This would allow us to get the correct size after a quick sync of a newly submitted file.

jspelletier commented 1 year ago

The problem with triggers is what do you do if trying to use p4vfs on a already existing Perforce branch? Whatever the solution I think it needs to be able to work with any existing perforce depot.

BTW, another thing we are doing in our internal tool is storing files outside Perforce in some very fast http server as a cache as Perforce is not particularly fast at sending files using p4 print.

santagada commented 1 year ago

To explain again what we did:

Create a trigger that runs on change-content does p4 print, counts crlf's and store the windows size in a perforce attribute, the same database that stores all other file metadata (no thirdparty database needed, no split sources of truth).
On p4 sync we store the windows size on the stubs created, when asking perforce with p4 stat you can ask for extra parameters, its fast and always correct
as a fallback during syncing if the file doesn't have the windows size, you have to fallback and sync the file (p4 printing it to count the size and then throwing out the results is not very useful).
for existing branches just run the same script once and populate the attribute, admin users can update attributes of all files in all revisions.

ps: about p4 prints being slow it is true.... a caching system is really interesting, specially one that does lz4/snappy or some proprietary compression that has many tentacles ;), but is not necessary even on multiple terabyte repos.

Dragnalith commented 1 year ago

At GDC 2023 (last week) when Perforce presented their virtual filesystem solution they explained they improved "p4 sizes" capability to be able to return the file size according to workspace options.

When released, I will become easy to fix this issue I guess.

My 2 cents,

jessk-msft commented 1 year ago

That is certainly a feature we'll add support for right away! I really hope that the workspace fileSize could possibly be returned in the sync -n. If not, then we'll have to request the workspace fileSize in subsequent sizes command.

santagada commented 1 year ago

I wonder if perforce now stores both sizes or somehow fallback to calculate the size on the server when you ask for stats. Also I wonder if they are now also keeping a MD5 for the windows version of the file, would be nice to be able to use a normal md5hash function instead of having to curate it for text files.

PS: Is the perforce presentation public? I'll open a support request to see it but their implementation already starts skewed by not presenting it to all perforce users.

belkiss commented 1 year ago

Is the perforce presentation public?

Asked them for more info on twitter, they said they have a webinar coming at the end of April :)

billfreist commented 1 year ago

It was yesterday and they specifically mentioned the new sizes -C version of the command which respects your workspace settings, lazily calculating the local sizes.

Edit: I should also mention the new version is expected out mid May (2023.1). Their client side solution will also be in preview in that release.

jessk-msft commented 1 year ago

This will be fixed in the upcoming 1.27.0.0 release of P4VFS on with Helix Core server 2023.1/2442900 or later, using the new sizes -C support. Work is being done in branch dev/filesizes.

jessk-msft commented 10 months ago

Now supported in version 1.27.0.0 with p4vfs.exe sync -c on servers 2023.1 and later.

microsoft / p4vfs

Text file metadata for dehydrated files do not match the hydrated metadata #3