godotengine / godot

Godot Engine – Multi-platform 2D and 3D game engine
https://godotengine.org
MIT License
88.36k stars 20.01k forks source link

On Windows at least building with a -j > 1 scons parameter fails with a file not found even though the file exists #5042

Closed corekase closed 6 years ago

corekase commented 8 years ago

Operating system or device - Godot version:

Windows 10 Pro 64-bit, Godot source 64-bit build (tested with master branch)

Visual Studio 2015 Community, Windows 10 SDK Kit 10586.

Python 2.7.11, Scons 2.5.0, and Pywin32 220 all these tools are 32-bit.

Issue description (what happened, and what was expected):

When I issue the command:

scons platform=windows target=release_debug -jx

Where "x" is a number > 1 then the build process will eventually give a file not found error. Going to where it says the file is not found it is present. Specifying -j1 on the build line works fine.

Steps to reproduce:

scons platform=windows target=release_debug -j2

bojidar-bg commented 8 years ago

Can you give the build log, or at least the final part of it?

corekase commented 8 years ago

I didn't see an obvious log file anywhere in the directory so I copied the last bit of the command-prompt output:

http://pastebin.com/7wCqdF5B

Build line was:

scons platform=windows target=release_debug -j4

One thing to note: it never happens with the same file, always a different file so it might be some kind of race condition?

SuperUserNameMan commented 8 years ago

@headkase : it might be a stupid question, but does it change something if you add an space between -j and 4 ?

corekase commented 8 years ago

Nope, just did this from a clean clone:

scons -j 4 platform=windows target=release_debug

Still errored on a random file.

volzhs commented 8 years ago

I compile godot with this command on cmd as administrator permission.

"C:\Program Files (x86)\Microsoft Visual Studio 14.0\VC\vcvarsall.bat" amd64 && call C:\Python27\Scripts\scons.bat -j 4 platform=windows target=release_debug

how about this?

corekase commented 8 years ago

Nope, that errored right away too.

corekase commented 8 years ago

And I re-ran it with Adminstrator privileges, still errored with a file not found.

corekase commented 8 years ago

@volzhs What versions of Python, SCons, and Pywin32 do you have installed? If different than mine I'll try those.

Calinou commented 8 years ago

I can reproduce this with Python 2.7.11, SCons 2.5.0 and PyWin32 build 220, all in 64-bit, when compiling with VS 2015.

volzhs commented 8 years ago

@headkase Python 2.7.11, SCons 2.4.1 and PyWin32 build 220

corekase commented 8 years ago

@volzhs @Calinou

I downgraded SCons to 2.4.1 and compiled from a clean clone each of two times. They built fine using -j4 as a parameter. So, SCons 2.5.0 looks to be the culprit.

Griefchief commented 8 years ago

@headkase No, it happens on 2.4.0 for me when compiling MS VisualC++ sometimes (very rarely though).

It's a race condition most probably, some other compiler process is accessing/locking the same file that cannot be accessed... I have not looked more into it though, not an expert on multithreaded race conditions.

I remmeber there being a setting (or scons using by default) to randomize build order so that race conditions do not happen where multiple jobs are trying to build the same file...

This random building is probably the reason why sometimes we hit the race condition, and sometimes we do not.

corekase commented 8 years ago

Well, with 2.5.0 I had 100% failure with more than one job. With 2.4.1 I had it complete successfully two times in a row.

Griefchief commented 8 years ago

@headkase Random luck/unluck or 2.5.0 being really bad at randomizing build order...

corekase commented 8 years ago

As the issue has been demonstrated to be with SCons, not Godot itself, if there is nothing more to add then it may be closed. Thank you for the help everyone.

volzhs commented 8 years ago

@Griefchief @headkase I have compiled hundreds times with this environment for months, and I never got failure compiling.

Griefchief commented 8 years ago

Ah, OK, i'm wrong, SCons is not randomizing dependencies on build order...

@headkase can you add --random to your scons builds in 2.5.0 and test if it will error? Also without?

http://scons.org/doc/HTML/scons-user.html#idp1392274268

corekase commented 8 years ago

@Griefchief I have a working system now. Can you upgrade to 2.5.0 and try a multi-job compile with the --random option? Then add your results here.

Griefchief commented 8 years ago

@headkase Hm, OK, I can probably do that, will be in a few hours, or a day or two though, got some work currently. Thanks.

corekase commented 8 years ago

Additional note: Rémi Verschelde on Facebook indicated that it should still be considered a bug in the build system as SCons 2.5.0 works correctly under Linux.

https://www.facebook.com/groups/godotengine/permalink/785635728239690/

So I recall the request to close this issue at this time.

akien-mga commented 8 years ago

Well since the error is tools\editor\editor_fonts.cpp(33): fatal error C1083: Cannot open include file: 'builtin_fonts.h': No such file or directory, it's a buildsystem bug. This file is one of the first ones that the buildsystem should generate from the TTF files. It should not start building the editor before this one was done.

corekase commented 8 years ago

@akien-mga If there is a concrete test then I will gladly re-upgrade to 2.5.0 and build the test - that's the least I can do for the Godot project.

Griefchief commented 8 years ago

It seems that MSVC has it's own multithreaded build... maybe paralel builds arent thread safe without? maybe we can use this to avoid the race condition with -j by redirecting -j into the compiler itself (if -j can be overriden)...

I'm just pasting this link here for reference, i have no idea if this will aleviate the race condition... or help with speed at all... (I guess it might, a bit)

https://msdn.microsoft.com/en-us/library/bb385193.aspx

gau-veldt commented 8 years ago

Investigation needs to be done to determine if scons uses process forking (creates separate copied environments similar to using & in sh in linux) or threading (runs in same process thus sharing resources and all classic threading issues apply) to implement -j.

My hunch is we are getting collisions on the algorithm or system call generating tempfile names because in Windows command lines need to be passed via files due to the system's limit on the length of command lines. However if scons uses threads then the potential error space expands substantially.

However with processes (distinct environments) the error space is much reduced and a weak tempfile naming algorithm that generates a high number of collisions in tempfile names would generate the very race situation where a later job overwrites an (shared due to opening same file on account of naming collision) argument file with different include specs before a prior job using the same file finishes.

I suspect this reason due to past experience a couple years back where the problem with building (in mingw at the time usually during links where the argument size is quite large) was due to the build process not using temporary argument files and this functionality had to be added to the project.

Griefchief commented 8 years ago

Agents, I have a hunch that maybe none of us have installed pywin32 properly into python? My SCons complains about python even though i think i installed it earlier (maybe into the wrong version)...

Is somebody not receiving a SCons warning related to pywin32 immediately when running a build (just after SCons stops processing SConstruct and starts to build targets)and getting these missing .h files?

1>scons : warning : you do not seem to have the pywin32 extensions installed;
1>      parallel (-j) builds may not work reliably with open Python files.
1>  File "c:\Python27\Scripts\scons.py", line 199, in <module>
1>  scons: Building targets ...
corekase commented 8 years ago

@Griefchief I am not receiving such an error at all. I would suggest you correct your Pywin32 installation.

Edit: Also, I have successfully built parallel builds about 4 more times with SCons 2.4.1.

Griefchief commented 8 years ago

@headkase Thank you, this rules out Pywin32 probably... One fine folk from freenode # SCONS says that pywin32 is mostly about python file objects (generated by open() and file())... I guess it's probably not pywin32 related, we'll probably need investigate what gau-veldt mentioned.

Griefchief commented 8 years ago

gau-veldt I see 5-6 cl.exe processes in my Windows Task Manager when i run the build so... i am guessing processes, i remember something about WTM only showing processes...

Your hunch is most probably correct, though i have no idea how to test it. Any testing advice? Some tool that checks what uses what temp file or something?

Griefchief commented 8 years ago

I am on this.

This seems to be a generated .h dependency file issue in at least one of the cases (SCons needs to be told manually about the dependency probably, i talked to kind folks at SCONS freenode)...

Specifically, we seem to be generating .h files in:

drivers\gles2\shaders

and some of them are erroring...

Is anybody else getting errors that are not related to glsl materials .h files in their builds? If so, please post here?

Thank you.

corekase commented 8 years ago

@Griefchief Please keep in mind that there is a perfectly workable solution to the issue at the moment: use SCons 2.4.1. I'm sure that with that in mind this bug report is a low priority. Every reply in this thread also emails the participants so trying to keep that to a minimum - distraction wise - would be best. If you have something very concrete to suggest then please do post however.

Griefchief commented 8 years ago

The concrete is that i am on this bug. I assigned it to myself so that other people don't duplicate work. SCons folks are also very kind in helping me with this one.

2.4.1 is not a solution, it errors for me every time. Getting reliable multijob builds is important because of developers time and because dailes for daily build providers (people that produce dailies) take 6 hours to build for all platforms. If i can help them build a bit faster at least for win, that would probably be awesome (shave half an hour or more).

corekase commented 8 years ago

@Griefchief

Hmm. Well, open your own issue then and reference to this one.

Closing, if a Godot project member feels necessary to reopen then do.

vnen commented 8 years ago

The issue is not solved and @Griefchief is looking into it, so let's reopen.

akien-mga commented 8 years ago

Using SCons 2.4.1 is not a solution. As I mentioned several times, there is an obvious bug in our buildsystem that needs to be fixed, and was probably made prominent by the addition of editor fonts a few days ago that need to be parsed into a header before compiling the rest of the editor.

Such racing conditions are pretty common as soon as build order matters, and I'm pretty confident @Griefchief will find a fix. At any rate, this issue would be a blocker for 2.1, as we won't tell any Windows user to downgrade their SCons version or switch to Linux :)

gau-veldt commented 8 years ago

@Griefchief A tempfile collision issue could be rendered null if the tempfile names were prepended with the job slot number since there's no reason two processes or threads should be running at the same time for a given job slot since while a compile is in process that slot is busy and unavailable to run another until it finishes the one it's on. Caveat: Unless each job runs its own loop deciding what to build next (in this case each job is generating new tempfiles). In that case we aren't just going to have headaches, they are going to be full-on migraines. In that case see PSS since we now have race potential on the build state (dependency) database.

PS: to verify tempname uniqueness would it be possible at add build debug code that writes lines like: job N entering cl.exe (using temp#####.file) and job N cl.exe exited (using temp#####.file) (these need a suitably unique pattern that can be scanned for easily in a filter among the rest of the wall of output [TM])? Scanning the output of a build filtered for these two messages should pair 1-to-1 (a given temp#####.file should appear in an entering message once-and-only-once before the corresponding exited message mentioning the same temp#####.file). If it enters for the same temp#####.file more than once (different jobs) before a closing message I believe that could be the basis for a big problem. Two concurrent jobs using the same temp#####.file is absolutely bad (it will be wiping and overwriting the file being used by a build in progress and the races have begun).

PSS: Were this a build order issue I'd expect it to be happening even if the build is not interleaved on multiple jobs unless the processing for multiple jobs is missing a check for dependency graph fulfillment that is performed when built via the single lane execution path. AFAIK scons itself only runs on a single controlling process and spawns the jobs as needed thus it should be doing the dependency tree management in a suitably serialized manner unless my AFAIK assertion is actually false (meaning in that case scons using multiple processes/threads for itself or the jobs themselves are loops and decide what to build next ie: potentially very racy dependency queries and writing of multiple tempfiles within the same job). This possibility needs further investigation if no tempfile naming collisions are observed across concurrent spawns.

gau-veldt commented 8 years ago

@headkase I use a line like the following to make a build log: scons platform=windows -j8 >foofoofoo 2>&1 The last bit makes sure the output and error streams both occur in the log.

corekase commented 8 years ago

Well, obviously I underestimated the importance of this issue.

@Griefchief I owe you an apology so here it is: I'm sorry if I came across as a jerk, I didn't mean to and I would hope that I haven't driven you away from this issue.

ghost commented 8 years ago

On OSX -j 2 works fine for me.

akien-mga commented 8 years ago

Not critical for the upcoming 2.1 as there is a known workaround (though it's far from optimal), so moving to the next milestone.

Griefchief commented 8 years ago

Just wanted to note that a known (propper) fix is present in this pull request here, graciously helped by the maintainer of SCons:

https://github.com/Griefchief/godot/pull/2

I need to add the same thing to compressed.h and maybe one more header and i believe i can do this for 1.1 so i'm hoping do it before 1.1 stable :) ...

But whatever milestone you wish to put, no problem :) I'm just informing folks here.

gau-veldt commented 8 years ago

@Griefchief Is there a way to fix this that doesn't require a special case every time a new automagically generated header is added? I can see such a solution blowing up quickly as the project needs more and more automagically generated headers. Escpecially with the plans for 3.0, visualscripting, multiplayer networking, etc. (I see a lot of these features requiring metacode techniques thus autogenerating lots of headers).

We need to try and find a single point of failure since, as mentioned, linux doesn't have this problem in SCons 2.5.

Griefchief commented 8 years ago

@gau-veldt Good idea, I need to check what stuff is required by the code that builds those header files, if i can, i'll put it all into a single function that does all the building depending on what you want... I'll keep it in mind.

vnen commented 7 years ago

I learned a few things about SCons, so I'll finally update mine and try to debug this.

The changes proposed in Griefchief/godot#2 seem more like a hack to me, working around the problem instead of solving it. There's even a statement in the PR saying it fails with -j6.

Griefchief commented 7 years ago

Yeah, it fails because the line was not added two other places where it needs to be added.

I'm relatively certain this is probably the correct fix cause bdbaddog has seen this issue before (he's the maintainer of SCons).

I also fixed it slightly differently.... and more hackishly... my original fix was to just place a call to the python sleep function into a "windows_sleep()" function and have it sleep for 1 to 2 seconds... and call the "windows_sleep()" function in every builder that builds header files that fail... I don't believe it is in my repo however... but easily reproducible... just put the builder function to sleep :)

Essentially, what seems to happen is that windows/python has a bug that releases the handles to the generated _h files a bit slow for the latest versions of SCons (one of the changes in the latest version of scons is that it's faster...). It also seems that the time to release seems to be related to the size of the generated file... there is a certs file i think that has several megabytes, and that one takes more then the smaller shader generated _h files... about 1 to 2 seconds... the shader files all release handles in less then 0.3 sec or so...

But in general, SCons folks reccomend putting complicated builders into their own separate processes like in the mentioned godot 2 example...

Griefchief commented 7 years ago

By the way... there is almost no way to debug handles properly and easily in windows... the process explorer (that i believe i used) takes 15 seconds or more to try to find handles to a file... in essence, unusable for the 0.3 of a second this happens in...

The problem is somewhere deep inside pythong and or/windows... And the process thing is a workaround in the sense that (I guess/believe) that processes are guaranteed to finish after handles are released... whether it's pythons GC that releases the handles or it's a deeper win issue i woudn't know...

You are welcome to try to debug it, but i've messed for more then a few days with this (more like a week) and those are my findings :)

It's not a godot issue...

Griefchief commented 7 years ago

Here is my quick and "dirty" fix for this from 3 months ago... and this one does work (on my pc... but i set the wait times to be a lot longer so it should work on probably every pc)...

https://github.com/Griefchief/godot/commit/4c3b9c2a71b69ec5922bdd47a5bde4acfe09a477#diff-45b856a77c4dd4017d6ebe6299a75cb4R1463

it has a lot of unecessary f.close() (file close) attempts that don't actually do anything but the windows_sleep() function does work...

To reeiterate, the issue seems to be that file closing in python (within the same process?) seems to be "slow"... so sleeping the thread fixes it... gives enough time for even compressed.h to close... even though that certs file has several mb...

Basically, this fix is quick and dirty if people need it now :)

While I or somebody does a propper multi process thing for builders based on Griefchief/godot#2

vnen commented 7 years ago

Oh, sorry then I misunderstood the cause. I was thinking this was problem with dependency resolution.

Since the problem is the file handles not being released in time, then I agree that spawning a new Python process to run the generator is cleaner. Doesn't seem a hard thing to do, basing on the work on Griefchief/godot#2.

For now the easiest work around is just start SCons again without cleaning. Since the files are already generated, it won't be a problem in the second time. I had to restart the build a few times but it works for me.

akien-mga commented 7 years ago

As far as I know it's still valid, but we're still not closer to having an idea about fixing it. If some SCons wizard passes by, feel free to pitch in ;)

Zireael07 commented 7 years ago

Yeah, it's still valid. And the reason I was against specifying j>1 as default recently.

vnen commented 7 years ago

Well, the one and only proposed solution is Griefchief/godot#2 (made by a co-manager of the SCons project). I'm not really satisfied with it, since it needs to change the logic for all generated files. Still, if anyone want to do it, I guess it wouldn't be so bad.