gslab-econ / gslab_python

Python tools for GSLab
MIT License
13 stars 11 forks source link

Propose solution to scons target out-of-date issue #61

Closed lboxell closed 7 years ago

lboxell commented 7 years ago

See https://github.com/gslab-econ/congress_text/issues/172#issuecomment-293354138.

lboxell commented 7 years ago

Adding the following to template fixes the issue but makes things run more slowly because we have to parse the .sconsign.dblite many times and calculate the content signature. However, there should be ways to speed things up and this gives a general outline of what to do to fix the situation. I tested it on the template by manually editing build/data/data.txt and re-running scons, which caused the build_r(["build/data/data.txt"], ["source/data/build.R"]) line to be run, but then nothing else (using the default decider by scons would cause everything downstream of build/data/data.txt to be re-run, but not build/data/data.txt itself).

def parse_sconsign_for_csig(tgt):
    sconsign = os.popen('sconsign --csig .sconsign.dblite').readlines() # http://stackoverflow.com/questions/3503879/assign-output-of-os-system-to-a-variable-and-prevent-it-from-being-displayed-on
    tgt_dir = '/'.join(tgt.split('/')[0:(len(tgt.split('/')) - 1)])
    tgt_file = tgt.split('/')[len(tgt.split('/')) - 1]
    dir_flag = 0
    for line in sconsign:
        if dir_flag == 1:
            if re.search('^' + tgt_file, line):
                csig = line.split(' ')[1]
                return csig
        else:
            if re.search(tgt_dir, line):
                dir_flag = 1

def new_decider(dependency, target, prev_ni):
    tgt = str(target)
    tgt_csig = parse_sconsign_for_csig(tgt).strip()
    if tgt_csig != target.get_csig(): 
        return True
    else:
        return dependency.changed_timestamp_then_content(target, prev_ni)

env.Decider(new_decider) 
lboxell commented 7 years ago

@gentzkow @jmshapir,

In https://github.com/gslab-econ/congress_text/issues/172#issuecomment-293354138, we discovered a pretty fundamental flaw in how we've been using the scons framework with version control. Consider the following example.

Assume the directory has the following tree:

source/source1 --> build/data1 --
                                  \
                                    --> build/output1
                                 /
              source/source2 --

(1) Person A and Person B both have a complete data build. (2) Person B updates the data build by modifying source1 and commits source + the .sconsign.dblite file to github. (3) Person A pulls the changes, which will update his .sconsign.dblite file and source code, but not change his build folder. (4) Person A runs scons. Then scons will view the build/data1 file as having been changed, but will not consider source/source1 to have been changed because it is consistent with the .sconsign.dblite that he just pulled from github. This will cause everything downstream of build/data1 to be re-run (i.e., build/output1), but it will not re-build build/data1. Thus, the build becomes out-of-sync, because we'd actually want build/data1 to be re-run too.

It seems like we want any target to be re-built if it is out-of-date. That is, we want all targets to be their own source (which scons doesn't allow by default because it doesn't allow loops in the tree).

My comment above provides a hack around this. Essentially, it creates a new decider function that first checks whether the content signature of the current target is the same as the content signature of the target with the same name in the .sconsign.dblite and re-builds if not. If they are the same, it then follows scons' standard procedure for deciding whether to rebuild based on whether the sources have changed.

I wanted to: (a) notify you of this issue, as it may be relevant to work in other, non-GSLab places. (b) see whether you think my hack is the right way to proceed in order to address this issue.

lboxell commented 7 years ago

See template branch here with edited code.

gentzkow commented 7 years ago

Thanks @lboxell.

I worry that adding hacks like this is going to move us toward a progressively more buggy and less stable system. Whenever possible we want to be using SCons the way it was designed to be used. If that runs into problems, it probably means we're doing something wrong.

I guess our original sin here (i.e., deviation from the way SCons would be used in a software company) is versioning the .sconsign.dblite files. The original goal of that was to prevent people from having to rebuild an entire repository from scratch the first time they check it out. Can you summarize from the team's perspective what would be the downsides of moving to a model where we don't version .sconsign.dblite?

lboxell commented 7 years ago

Email from @mirajshah at Brown.

(1) A post from the SCons mailing list as of Jan 2017 shows that resolving this issue is an area of active development. There are some other posts in the mailing list from several years ago in which the SCons maintainers mention specifically that they do not want to account for targets being modified by an "outside force", as the extra runtime cost of checking the targets on every build would not be worth the benefit of absolute integrity of the dependency tree in such (apparently rare) cases. However, as is evident from the more recent linked thread, they do now want to give users the option. (2) Similar to your instinct, we believed that we needed to solve this problem immediately, as we will very often trade off runtime to gain correctness in a research context. Using the commit in the linked thread above as a reference, I wrote a simple decider function which additionally rebuilds if a target has changed but a source has not. It uses the native properties of the SCons Node objects, and it runs quite smoothly with a seemingly minor additional time cost. I realize you have a working implementation, however if you are interested, please try it out and let me know your thoughts/suggestions.

lboxell commented 7 years ago

@gentzkow,

FYI, see comment above. It looks like the Hastings-Shapiro lab implemented a similar solution (I'm less inclined to call it a "hack" now) to the issue and it has at least been discussed as a possible feature to be added to SCons.

I'll try and talk to the team tomorrow to see what we think about the pros-cons of both options.

lboxell commented 7 years ago

@gentzkow,

Overall summary of team's (@M-R-Sullivan, @arosenbe, and myself) perceptions: We think its worthwhile at least trying out the new decider function to see how it works in practice.

More details:

Pros of not versioning .sconsign.dblite:

Cons:

gentzkow commented 7 years ago

Thanks.

Did you guys look at the repository method http://scons.org/doc/2.1.0/HTML/scons-user/c3932.html? Does it look like this could be a substitute for cache? --

Matthew Gentzkow Professor of Economics Stanford University

lboxell commented 7 years ago

@gentzkow,

I recant on my previous statements. While looking into the repository method, I realized my understanding of cache was wrong. It looks like scons stores things based on their build signature in the cache and that it can calculate the build signature of a target without having access to the .sconsign.dblite. The below shows this:

siepr-mg-ra-imac-1:template leviboxell$ rm -rf .sconsign.dblite 
siepr-mg-ra-imac-1:template leviboxell$ rm -rf build
siepr-mg-ra-imac-1:template leviboxell$ rm -rf release
siepr-mg-ra-imac-1:template leviboxell$ scons mode=cache
scons: Reading SConscript files ...
scons: done reading SConscript files.
scons: Building targets ...
Retrieved `build/data/data.txt' from cache
Retrieved `build/analysis/plot.eps' from cache
Retrieved `build/analysis/table.txt' from cache
Retrieved `build/paper/ondeck.pdf' from cache
Retrieved `build/paper/online_appendix.pdf' from cache
Retrieved `build/tables/table.lyx' from cache
Retrieved `build/paper/paper.pdf' from cache
Retrieved `build/paper/text.pdf' from cache
Retrieved `build/talk/slides.pdf' from cache
Install directory: "build/paper" as "release/paper"
Install directory: "build/talk" as "release/talk"
state_of_repo(["state_of_repo.log"], [])
scons: done building targets.

Thus, it looks like if we don't version the .sconsign.dblite, then we can still use the cache when switching branches or a fresh clone. The key thing would be to make sure we delete the .sconsign.dblite whenever we switch branches, because otherwise, we still have the out-of-date target issue.

So, conditional on there being a way to force (or strongly encourage) people to delete their .sconsign.dblite when they switch branches, I think we can just stop versioning .sconsign.dblite and just rely on the cache.

There is still an issue of potential manual edits to a file that screw things up downstream, but I think that issue is smaller. Interestingly, if you remove the .sconsign.dblite on a complete build and re-run scons (but not in cache mode), then its rebuilds everything. So its only in cache mode that it checks the build signature first and compares to items in the cache.

gentzkow commented 7 years ago

OK. That's good news.

Can you remind me what the build signature is? This is some combination of the hashes of all the input files so it allows SCons to tell that a file stores in the cache is the correct one to retrieve?

As far as switching branches: Can you remind me what the out of date target issue would look like in this case? I would have thought this would be different than the Person A / Person B example above, because if I had a complete build and then switched branches I would have changed the source code but not .sconsign.dblite and so SCons would rebuild the target. But I haven't thought hard about it so I may be missing the problem.

What does seem true is that working with branches a lot would require cache mode to be efficient because otherwise you'd be rebuilding every time you switch branches.

In any case, once we're not versioning .sconsign.dblite, we should be using SCons exactly the way it's used in a standard software development context, so I would have expected that getting it to play nice with Github branches would be a solved problem.

lboxell commented 7 years ago

@gentzkow,

Your statement regarding the build signature matches my understanding ("some combination of the hashes of all the input files so it allows SCons to tell that a file stores in the cache is the correct one to retrieve").

For the out-of-date target issue, I think you're right again. I agree now that I don't think there should be an issue with switching branches because the .sconsign.dblite doesn't change, so, as you say, scons perceives a source file change and re-builds (or pulls the correct file from cache based on the build signature).

So, I think we should just stop versioning the .sconsign.dblite and the issue is solved.

gentzkow commented 7 years ago

OK. We're all good then.

Out of curiosity: Did you get a sense of how the repository option works? Any reason you'd see we ought to be using that? --

Matthew Gentzkow Professor of Economics Stanford University

lboxell commented 7 years ago

@gentzkow,

The repository and cache are somewhat similar, but from my understanding there to two main differences: (1) The repository can store source files, not just targets. (2) If you use a repository, then scons essentially looks at the union of the repository and your local directory to create the build and treats objects in the repository as if they were sitting in your local directory. Files in the repository are stored by their filename (not build signature). So if you want to use the repository method to also replace derived files, then you need to make sure the appropriate .sconsign file is stored in the repository to track these build signatures.

It seems like the repository method would be useful for storing items that we don't expect to really change across branches or will be relatively stable throughout the project. So it could be useful for raw data (i.e., we store the raw data for a project in Dropbox, which gets synced to everyones local machines and we just specify this as our "Repository" so scons grabs the data from their instead of having a separate download raw data step.).

For derived files that we expect to change frequently in a project, the cache seems the appropriate choice as my understanding is that the repository method can only store one "version" of a derived file.

gentzkow commented 7 years ago

Ok. Sounds like the repository method is worth keeping in mind as an option. --

Matthew Gentzkow Professor of Economics Stanford University

lboxell commented 7 years ago

Yup. Added a pointer in the admin outline to my comment above.

Closing this task.

Summary: