gitignoringGlob exceeding call stack - add explicit package globs to workspace?

roryc89 commented 5 months ago

On a very large monorepo project we find that Glob.gitignoringGlob exceeds the maximum callstack.

We can work around this by increasing the maximum stack size on node but we would rather not have to do this.

Another possible solution would be to allow explicit setting of globs to find packages and to not examine .gitignore files.

This would also allow generated, gitignored packages to be part of a workspace.

f-f commented 5 months ago

Let's unpack this a little:

On a very large monorepo project we find that Glob.gitignoringGlob exceeds the maximum callstack.

I'd consider this a bug. Can we rewrite Glob.gitignoringGlob to be stack-safe?

This would also allow generated, gitignored packages to be part of a workspace.

Can I ask you what's the usecase for generating packages?

In any case, you should be able to include them in the build as extra packages (i.e. not autodetected), by pointing at their path, e.g.

workspace:
  extraPackages:
    some-package:
      path: "generated-packages/some-package"

cakekindel commented 5 months ago

I recently touched the gitignoringglob implementation and have a hunch that it's a traverse that i added, i'll look now and add a stack-safe PR

cakekindel commented 5 months ago

@roryc89 can you try the commit in #1234 and let me know if that works on your repo? (also make a wish! 1234!)

cakekindel commented 5 months ago

Here's the fork with the bundle (./bin/bundle.js) if you don't want to clone and build it yourself: spago-4251576.zip

roryc89 commented 5 months ago

@f-f

I'd consider this a bug. Can we rewrite Glob.gitignoringGlob to be stack-safe?

If it's possible to do this and keep it reasonably performant on a large project, sounds good to me.

Can I ask you what's the usecase for generating packages?

They are for graphql schema .purs files that we generate from schema introspection.

Your example using extraPackages would work fine for us so it wouldn't be needed for us. Having to add 100s of extra packages will be a bit noisy but I understand our use case is probably not common.

roryc89 commented 5 months ago

@cakekindel Sorry, this was due to a mistake on my part.

We had our .gitignore in the parent directory of our purs workspace. So this meant that we were traversing all of the dirs outside of version control.

Your changes did prevent the call stack being exceeded but it took too long to run and after a few minutes of waiting I cancelled the compilation.

@f-f I still think that allowing explicit package globs would be useful in our case as we prefer not to have multiple .gitignores in our repo but also this is not a big deal and I'd be happy to close this issue if it's not something that is relevant to most users.

f-f commented 5 months ago

@roryc89 I am interested in learning about your usecase because often it's possible to arrange the build in a way that can be addressed by existing features. That's preferrable to adding features to precisely address any given build, since in the long term it leads to a situation where there are many switches that address overlapping concerns.

For example, would it be possible for you to generate all the sources in a single project instead of creating 100s of different projects? In that way you could only have one extra package to add to the workspace file.

Is there a reason for avoiding multiple gitignores? I would consider these few lines of gitignore worthwhile if they would avoid implementing an entirely new feature in spago. We used to ask git if a file was gitignored, but that was not very performant. I wonder if there's a way to ask git about the ignored state of an entire tree. Or I wonder if we could climb up the tree and find the root .git folder, and start reading .gitignore files from there.

roryc89 commented 5 months ago

@f-f sure thing.

For example, would it be possible for you to generate all the sources in a single project instead of creating 100s of different projects?

We have many apps that have quite fine grained permissions on what parts of the Graphql APIs they can access. We prefer to not to have unnecessary dependencies (for faster packages builds) so we have split out the generated schema files into quite a few packages. Most of the packages are shared dependencies of the various schema packages and we typically install a few schema packages per app.

Is there a reason for avoiding multiple gitignores?

We originally found it easier to manage all our gitignore paths in a single file. With multiple .gitignoress duplication tends to creep in. But, to be honest, we have ended up with a some extra ones anyway (in others parts of the repo), so it's not something we have been strict on and I really don't see it as a big deal.

I think that allowing spago workspaces to be decoupled from git could be useful for other users but I also understand that it may not worth building/maintaining the extra config option if it's only useful to a few users.

f-f commented 5 months ago

We have many apps that have quite fine grained permissions on what parts of the Graphql APIs they can access. We prefer to not to have unnecessary dependencies (for faster packages builds) so we have split out the generated schema files into quite a few packages. Most of the packages are shared dependencies of the various schema packages and we typically install a few schema packages per app.

Ah, this makes a lot of sense - thanks for the context!

I think that allowing spago workspaces to be decoupled from git could be useful for other users but I also understand that it may not worth building/maintaining the extra config option if it's only useful to a few users.

Indeed - I've been trying to keep this part of Spago as hands-free for the user as possible, but the line between "it's magic so I don't have to worry about it" and "it's magic so I don't understand how to bend it to do my thing" is very fine sometimes. I think a part of this comes down to convention-vs-configuration, and I'd still like to lean on the side of least-configuration for as long as it's comfortable.

Re git coupling: I think "parsing the gitignore" is the sweet spot here; we used to call git to ask if a file was gitignored, but that was slow and coupled to git. Now we just parse the files, so even if you're not using git you can still configure the ignored folders in the .gitignore.

purescript / spago

gitignoringGlob exceeding call stack - add explicit package globs to workspace? #1231