Closed bionicbeagle closed 10 months ago
When looking at this driver I was very curious how Microsoft solved this problem... so apparently there's no fix for it yet.
When we needed something like that we created a post submit trigger called crlfsizer.py that counted the actual size of files by counting end of lines by p4printing all files in the cl and then adding the crlf size as an attribute to all changelists/files.
A bit messy but I think its the only possible solution for this.
In our tools we often get the size of the file and then read the file in memory to process it. Those tools are failing when the file is virtualized as the files are truncated due to the file size being smaller.
Unfortunately this is expected behaviour for non-binary file types. The size of hydrated file on the PC may differ from the filesize from the perforce file metadata, depending on the workspace encoding options. This is caused by the file BOM and line end character selection. A proper solution would likely require support on the Perforce server. At Microsoft we've worked around this by:
I see two solutions
Could also store these in a server somewhere optionally to avoid multiple users computing this.
Note: We have a similar tool at ubisoft and we are doing p4 print in background to compute the file sizes.
That's a really nice solution jspelletier. I'm not sure how it could be done quickly and atomically during the P4VFS sync process. We have also considered having a perforce trigger do something similar for text files in a change-commit trigger, and write the size/revision info to another database. This would allow us to get the correct size after a quick sync of a newly submitted file.
The problem with triggers is what do you do if trying to use p4vfs on a already existing Perforce branch? Whatever the solution I think it needs to be able to work with any existing perforce depot.
BTW, another thing we are doing in our internal tool is storing files outside Perforce in some very fast http server as a cache as Perforce is not particularly fast at sending files using p4 print.
To explain again what we did:
Create a trigger that runs on change-content
does p4 print, counts crlf's and store the windows size in a perforce attribute, the same database that stores all other file metadata (no thirdparty database needed, no split sources of truth).
On p4 sync we store the windows size on the stubs created, when asking perforce with p4 stat you can ask for extra parameters, its fast and always correct
as a fallback during syncing if the file doesn't have the windows size, you have to fallback and sync the file (p4 printing it to count the size and then throwing out the results is not very useful).
for existing branches just run the same script once and populate the attribute, admin users can update attributes of all files in all revisions.
ps: about p4 prints being slow it is true.... a caching system is really interesting, specially one that does lz4/snappy or some proprietary compression that has many tentacles ;), but is not necessary even on multiple terabyte repos.
At GDC 2023 (last week) when Perforce presented their virtual filesystem solution they explained they improved "p4 sizes" capability to be able to return the file size according to workspace options.
When released, I will become easy to fix this issue I guess.
My 2 cents,
That is certainly a feature we'll add support for right away! I really hope that the workspace fileSize could possibly be returned in the sync -n
. If not, then we'll have to request the workspace fileSize in subsequent sizes
command.
I wonder if perforce now stores both sizes or somehow fallback to calculate the size on the server when you ask for stats. Also I wonder if they are now also keeping a MD5 for the windows version of the file, would be nice to be able to use a normal md5hash function instead of having to curate it for text files.
PS: Is the perforce presentation public? I'll open a support request to see it but their implementation already starts skewed by not presenting it to all perforce users.
Is the perforce presentation public?
Asked them for more info on twitter, they said they have a webinar coming at the end of April :)
It was yesterday and they specifically mentioned the new sizes -C
version of the command which respects your workspace settings, lazily calculating the local sizes.
Edit: I should also mention the new version is expected out mid May (2023.1). Their client side solution will also be in preview in that release.
This will be fixed in the upcoming 1.27.0.0 release of P4VFS on with Helix Core server 2023.1/2442900 or later, using the new sizes -C
support. Work is being done in branch dev/filesizes.
Now supported in version 1.27.0.0 with p4vfs.exe sync -c
on servers 2023.1 and later.
I noticed that the file size before and after hydration differs. This causes unexpected things to happen in some cases. For example:
As you can see, the unhydrated file size is reported as 455 and the hydrated file size is 471. The difference is presumably because p4 stores text files in a normalized form on the server and transcodes it according to client settings as part of the syncing process.
The end result is that in this case
more
will only do a partial read of the file the first time, but the second time it reads the entire file. Many other processes still function as expected, which I assume means they just keep reading until EOF instead of checking the file size up front.