How to write to Partial sealed directory

vladimir-cheverdyuk-altium commented 2 years ago

Hi

I'm trying to run a tool from BuildXL that consumes files from some directory and writes new files to the same directory. For what I read, it looks like I have to use partial sealed directory but I can't figure out how to specify that output of that tool should be that partial sealed directory.

For example:

const sealedDir = Transformer.sealPartialDirectory(dir1, globR(dir1));

I cannot specify sealedDir in outputs because it is not compatible. If I try to use sealedDir.root there, then BuildXL got some internal error. If I try to use dir1 in outputs, I got error that dir1 coincides with the sealed directory.

What is correct way to use it?

Thank you, Vlad

smera commented 2 years ago

Hi Vladimir, For specifying inputs, you can either declare input files individually or use sealed directories. There are different flavors of sealed directories, and for inputs you can either use a full/partial one or source seal directories. At some level all these artifacts (including using individual files) are somewhat equivalent. Going finer-grained, the differences show up around how strict these declarations are (e.g. allowing outputs to be present in the same cone, letting any file be part of a tool sources as long as it is placed under a given cone, etc.). The other aspect of sealed directories vs individual files is that sealed directories allow you to treat all their members as a single entity (the seal directory) that you can reference and pass around. On the other hand, for outputs, you have a similar choice. You either declare the outputs individually (and you can choose whether they are mandatory or optional) or use (sealed) output opaque directories. In your case, if sources and outputs are all intermixed, you can either declare the outputs explicitly or use shared opaque directories, which allows sources to be present in the same cone as well as multiple producers. The main difference between these approaches are

Declaring individual output files will enable BuildXL to apply more static enforcements early in the build (e.g. detecting a double write before it actually happens, or enforcing that a tool is not producing outputs you were not expecting)
Using an opaque directory does not require that you know all the outputs that will be produced upfront (you just need to a root, or set of roots, under which all the output files will be produced). In addition, most of the static validations that BuildXL performs will be turned into dynamic ones. So you'll still get them but they can't be statically predicted anymore and will start depending on execution behavior.

Thanks, Serge

vladimir-cheverdyuk-altium commented 2 years ago

Hi Serge

Thank you for so detailed explanation.

My I ask slightly different question, I have solution in some directory and it contains multiple C++ projects that are located in sub-directories and during compilation each directory will have x64 directory created with intermediate output. I don't really care about that output because actual results will be produced in different directory.

Structure looks like this:

Root
  Prj1 source code
  Prj1\x64 - .lib, .obj etc
  Prj2 source code
  Prj2\x64 - .lib, .obj etc
....
  Prj100 source code
  Prj100\x64 - .lib, .obj etc
  root.sln
  root.vsprojx

What is the best to use in this case? I was trying to use sealed source code for Prj1 ... Prj100 and regular output directory for Prj1\x64 to Prj100\x64, but it is quite painful, specially considered that in reality each project has unique name.

I was planning to declare Root as sealed partial directory because x64 cleaned after checkout from VCS but I don't know how to specify intermediate output

Thank you, Vlad

Vlad

smera commented 2 years ago

If you have multiple projects generating intermediate outputs under each project root, an easy way to deal with that which doesn't imply specifying each project root/individual outputs is to declare a catch-all shared opaque directory at the root of your source root. E.g. if you have projects src/prj1, src/prj2, etc. you can declare a shared opaque at 'src'. That would allow any output produced under that folder (and recursively below). This shared opaque can be declare for each pip (tool execution), and that means each pip will have an output directory that can be consumed by downstream tool. Each output directory (even though they will all share a common 'src' root) will contain only the outputs generated by the corresponding pip. So if you have a downstream tool that then creates a final deployment (presumably outside of the 'src' structure), this tool need to take a dependency on all the output directories representing intermediate outputs. Hope this helps, please feel free to ping me if you have additional questions.

Thanks, Serge.

vladimir-cheverdyuk-altium commented 2 years ago

How should I declare src directory? I tried as sealed source or seal partial and every time I got error

Vlad

smera commented 2 years ago

The problem with a source sealed directory is that it doesn't allow outputs to occur under it. A better candidate for your scenario is maybe a partial seal directory, where you can glob for all sources, and can be declares as an input for your pips.

Thanks, Serge.

vladimir-cheverdyuk-altium commented 2 years ago

It does not work and I got error message:

C:\Projects\BuildXL\Test2\Hello.World.Project.dsc(49,33): error DX14109: Invalid graph since 'C:\Src', produced by 'Pip10B5B892974AE966, MSBuild.exe, HelloWorld, compileApiPip, {}', coincides with the sealed directory 'C:\Src', produced by 'Pip5A7E2A94930175B5, <SEALDIRECTORY>, HelloWorld, apiDir, {}, => Src(2765 entries) || 'Src' [2765 files - 0 output directories]'.

Code is here:

const apiDir = Transformer.sealPartialDirectory(apiRoot, globR(apiRoot));
const apiDirOut : Directory  = apiRoot;

export const compileApiPip = Transformer.execute({
    tool: msBuild16Tool,
    arguments: [
        Cmd.argument(Artifact.input(slnFile)),
        Cmd.argument("/target:Rebuild"),
        Cmd.argument("/p:Configuration=Production"),
        Cmd.argument("/p:Platform=x64")
    ],
    allowedSurvivingChildProcessNames: [
        "conhost.exe",
        "mspdbsrv.exe",
        "MSBuild.exe"
    ],
    dependencies: [
        prjFile,

        apiDir,
    ],
    implicitOutputs: [apiDirOut],
    outputs: [
        outDir
...

Perhaps I'm doing something wrong?

smera commented 2 years ago

Would you mind sharing the full code? I suspect 'outDir' in this case is an exclusive opaque directory, which wouldn't allow any inputs underneath. In order to specify that output as a shared opaque directory, you can follow the examples here.

vladimir-cheverdyuk-altium commented 2 years ago

outDir is mount to different directory. I removed as much as possible from project and attached it. Create empty C:\Code\Src\Api, and empty C:\Code\Installation and run project.

I got following error:

C:\Projects\BuildXL\Test3\Hello.World.Project.dsc(34,34): error DX14109: Invalid graph since 'C:\Code\Src\Api', produced by 'Pip58AFDC910DE2DC8E, MSBuild.exe, HelloWorld, compileApiPip, {}', coincides with the sealed directory 'C:\Code\Src\Api', produced by 'Pip4E08E5F54D725457, <SEALDIRECTORY>, HelloWorld, apiDir, {}, => Api(0 entries) || 'C:\Code\Src\Api' [0 files - 0 output directories]'.

Test3.zip

Thank you, Vlad

smera commented 2 years ago

Test3-commented.zip Tweaked some things and also added some general comments across the board. Please take a look and feel free to ask any questions you may have.

Thanks, Serge.

vladimir-cheverdyuk-altium commented 2 years ago

Hi Serge

Thank you for your help. I found that I accidentally used wrong type of directory. Looks like I clicked on wrong link in Wiki. I appreciate comments as well. I took config.dsc and module.config.dsc from HelloWorld example and I didn't change them.

I agree with comments for untrackedDirectories. I just need something working and then go to details. The same applies for qualifiers.

I have 2 more questions:

I see that content of outDir is wiped out. Usually Installation directory contains some static files that will be included in installation and PIP will just add dynamically generated files to it. Is it possible to prevent wiping this directory?
My understanding that whatever is produced by PIP will be put in cache with strong fingerprint of input. On next run, if strong fingerprint didn't change, BuildXL will just take data from cache. Is it possible to have some central location where this cache will be located so it can be used from different computers?

Vlad

smera commented 2 years ago

If 'outDir' will contain some files upfront, besides the ones produced by the tool, I suggest you turn that directory into a shared opaque one as well. Exclusive opaque directories are wiped out because they are not allowed to have sources or outputs from other tools (that the 'exclusive' part). So wiping them out guarantees build determinism. Making that a shared opaque will allow sources and other outputs to be there.

For having a shared cache across machines. Yes, that's an existent cache feature. Unfortunately, we don't have public documentation for it yet. In essence that's a service that has to be configured and has some tricky deployment steps. I opened up an internal work item to do some minimal documentation around this scenario, will let you know when we have that available.

Thanks, Serge.

vladimir-cheverdyuk-altium commented 2 years ago

Thank you Serge. It works great with outDir. I will do more testings but looks like I have everything I need.

I will wait for documentation.

Thank you again for your help.

vladimir-cheverdyuk-altium commented 2 years ago

Hi Serge

I have another question about Test3-commented.zip:

        outputs: [
            // [Serge] You may want to turn this one into a share opaque as well if you plan to have other tools producing outputs under the same root
            outDir,
            // [Serge] changed this one to be a shared opaque. By default a directory being added as an output represents an exclusive opaque
            // An exclusive opaque directory does not allow sources or other opaque directories under its root. So the error you were getting
            // was about having this apiDirOut declared as an exclusive one, plus apiDir declaring sources under the same root
            {directory: apiDirOut, kind: "shared"}

Every build solution produces 63 Mb files into outDir. But it also produces 11 Gb intermediate content like lib, obj, pch and other files that nobody cares about. All of it goes to apiDirOut and looks like it placed into cache. As result, on second run BuildXL complains there is no space in cache. I can obviously increase cache size, but perhaps there is way to discard its content? It will save time registering that content in cache and obviously save a lot of space too.

Vlad

smera commented 2 years ago

Hi Vladimir, As a general consideration, my recommendation is to increase the cache size. You can pass a cache configuration file with /cacheConfigFilePath:. Some extra points though:

If there are produced files that are not consumed as part of the build (nor by post build consumers), you can untrack those directories. That will make bxl to not store them into the cache. This also mean they will not get replayed from the cache. So, if there are in-build consumers of those files, things will probably fail (depending on cache hit/miss behavior, and therefore not easy to spot).
You can build with a folder-based filter (e.g. request bxl to filter out everything that doesn't produce stuff under 'outDir'. That won't change the cache size requirements but will instruct bxl to 1) not run pips that don't produce outputs under 'outDir' and 2) the default lazy materialization behavior will work such that files outside of 'outDir' won't get materialized from the cache unless there is a running pip that requires them.

Thanks, Serge.

vladimir-cheverdyuk-altium commented 2 years ago

Hi Serge. But if I will have so huge cache data and cache will be on different computer, then it means it will have to transfer 11 Gb to and from cache. It was my main concern with cache increase.

What will be best way to untrack these directories? These are intermediate files and nobody consumes them. They are pretty much temp files required to create output and nothing more.
How can I build such folder-based filter? If you can, just point me to documentation and I will try to figure it out.

Thank you, Vlad

smera commented 2 years ago

If you build with a filter pointing to outDir (look at path-based filters), as long as you get cache hits you won't need to transfer data from the cache. E.g. if you have a chain of projects like A -> B -> C (the arrow meaning a dependent) and both A and B are cache hits and C a miss, only the outputs produced by B will be brought from the cache for C to consume (assuming A does not produce files that go into outDir). This is called lazy materialization and it is the default behavior of buildxl.

For untracking directories, that's something that can be specified at the pip level when calling Transformer.execute(...). You can take a look at that here. The option is unsafe for the reasons I mentioned before. Even if the outputs are intermediate (meaning that they are not part of outDir) you'd have to be sure they are not consumed as part of the build (e.g. a project produces and obj in some temp directory that is later consumed by some other project to produce the final executable). If that was the case, you'd need those files in the cache so they can get replayed properly.

Thanks, Serge.

vladimir-cheverdyuk-altium commented 2 years ago

Hi Serge.

None of these intermediate files will be consumed later and I can untrack them. But as I mentioned in 3rd post, there are quite a few projects there and anybody can add new project (and it already happened twice) and intermediate data for that project will go to cache.

Just in case I will explain full picture. We have build process that is working fine right now, but it rebuilds everything every single time. I would like to improve it by using BuildXL for one of the big solutions. I will have only one pip for now that will build that solution. If no input changes, then BuildXL will restore outputs from cache and speed up build process.

So, if I understand correctly, because it is single pip, BuildXL has to materialize all outputs including intermediate files. So in my case, if somebody will add Prj101, then, at the end of build, BuildXL will put everything from Prj101\x64 into cache. It is already bad, because it is extra time and traffic.

But on next run, even nothing changed, BuildXL has to materialize everything to Prj101\x64. It is necessary because build cleans everything at the start of build. Real output of the project is quite small but there are quite a lot of intermediate files.

As result there will be extra time to put intermediate data if anything changed and extra time to materialize unnecessary data when nothing changes.

I can write code that scans solution and all its projects to find where they output intermediate files and generate BuildXL script on the fly but it is not an easy task and perhaps it will have own issues.

It is why I asked if we can ignore that intermediate directory. If we could, then there will be little bit of code and everything will be reliable for future changes.

Thank you, Vlad

smera commented 2 years ago

Hey Vladimir, Do you have a shared cache in place? (That is, a cache shared by multiple machines). Asking this because without one, and having your build being a single-pip one, the benefits of using bxl are probably drastically reduced: you'll only get a cache hit whenever nothing changed from the last local build. This is a possible scenario, but not very probable. This would be different if your build is made out of multiple pips, where source changes may affect only a subset of those. If your build is an msbuild one, have you tried using the bxl msbuild resolver? that should automatically partition your build into a per-project pip.

Coming back to the untracked files question. Today this is statically provided data in the form of individual files or directory cones. This means you need to know upfront the location of the artifacts you want to untrack. I understand that the dynamic nature of projects being added/removed can make this hard to keep in sync, considering you want to untrack all intermediates of each project. Maybe if you can make all projects put their intermediates under a common folder (e.g. /out/obj), that would be easier to maintain.

And one last point. Bxl uses hardlinks by default. So there is actually no extra space taken by files in the cache: the outputs that you see (being final or intermediate) are actually hardlinks from the local cache. There is of course some natural overhead when tracking/materializing more files, but I'd curious to see what's the real impact of untracking all intermediates vs not.

Thanks, Serge.

vladimir-cheverdyuk-altium commented 2 years ago

Hi Serge

I'm not using shared cache yet. You said eventually there will be instruction on how to setup it. And as you said without it there is not much use for it. My project right now in research phase and I'm trying to see what I can do as first step to improve build time.

I was trying to use msbuild resolver some time and I got bunch of internal errors and I gave up.

I can change project to output all intermediate files to one common directory, but we have quite a few of such projects and I was trying to figure out, if it is possible to avoid doing it because it is a lot of work.

As for last paragraph, yes I saw that BuildXL uses hard links. But if data not in local cache then it has to materialize from network and as result, transfer a lot of data.

Thank you, Vlad

hayhurst-zz commented 1 year ago

Hi, I was wondering if there was ever any documentation or information provided about setting up a shared cache?

microsoft / BuildXL

How to write to Partial sealed directory #1295