Closed nh2 closed 2 years ago
Also from the docs:
The reason for providing only a very limited form of wildcard is to concisely express the common case of a large number of related files of the same file type without making it too easy to accidentally include unwanted files.
This rationale makes sense to motivat not adding super sophisticated globbing features, there's no point artificially restricting trivial use cases.
Agreed that this restriction isn't pulling its weight. The implementation would actually have been simpler if this restriction had never existed, but we don't live in that world.
The only snare is that we'll want to have the less-restricted globs only if the cabal-version
is sufficiently high and the previous behaviour if not, so there's no reduction in complexity. But defaults do matter and we should make the easy cases easy; the details about the history of the syntax allowed can be shunted off into an appendix or something, and I think we can test it well enough to give confidence in the correctness of the implementation.
This is a big pain for Idris in getting it working with new-install
. Every test for the compiler has files named run
and expected
, so we can't use wildcards as they are.
Iirc the extension is there to prevent sdisting tmp files, e.g. .dataFile.txt.swp or dataFile.txt~ or whatever hidden there.
And we don’t want maintain the list of all possible tmp file variants either.
On 13 Apr 2019, at 17.51, Niklas Larsson notifications@github.com wrote:
This is a big pain for Idris in getting it working with new-install. Every test for the compiler has files named run and expected, so we can't use wildcards as they are.
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or mute the thread.
How about at least letting '.' be considered a filename character? Cabal is currently telling me that
foo/*.template
isn't matching foo/bar.yml.template
.
Iirc the extension is there to prevent sdisting tmp files, e.g. .dataFile.txt.swp or dataFile.txt~ or whatever hidden there.
And we don’t want maintain the list of all possible tmp file variants either.
@phadej Then let's try to find another solution to exclude those, such as allowing the user to specify patterns to exclude.
Enabling a convenience feature (avoiding sdisting tempfiles) by making a normal use case completely impossible (globbing normal extensioneless files) is a severe restriction and makes cabal inconvenient to use.
The rest of the unix world won't add extensions to their files so that cabal can glob them.
@nh2, please tell when you find one.
"Allow user to specify patterns to exclude" is waiting for mistakes to happen: Including unnecessary files in sdist tarballs. And a mistake which is difficult to catch on CI (missing file is obvious).
So I'm simply :-1: on this, before some good and concrete plan forward is written down.
@chreekat use cabal-version: 2.4
https://cabal.readthedocs.io/en/latest/developing-packages.html#pkg-field-data-files
Prior to Cabal 2.4, when matching a wildcard plus extension, a file’s full extension must match exactly, so .gz matches foo.gz but not foo.tar.gz. This restriction has been lifted when cabal-version: 2.4 or greater so that .gz does match foo.tar.gz
@phadej one solution would be to do what hpack does: ignore hidden files (starting with '.') and otherwise just use Unix globs. I believe this would cover both cases you mentioned: .dataFile.txt.swp would be ignored, and dataFile.txt~ would not match *.txt
.
It's true that there is no perfect solution to this, but from a Linux user's perspective, some things are more surprising than others. Cabal's behavior is pretty far out there.
Oh I just saw you pinged me as well. Thanks for the tip!
ignore hidden files (starting with '.') and otherwise just use Unix globs. I believe this would cover both cases you mentioned: .dataFile.txt.swp would be ignored, and dataFile.txt~ would not match
*.txt
.
This sounds like a good approach to me: Ignoring dotfiles is a common behaviour, and for the case that one wants them, one could give data-files: .*
explicitly.
@phadej Why can't I use the form foo/**/Makefile
as a wildcard? There is no risk for any confusion there.
+1 for removing the restrictions from globs when specifying data-files.
I have a directory data
where I have a lot of template files which I need packaged with the binary. There are dozens of these files.
So far I was using Stack and I had following config:
data-dir: data/
data-files:
- Generator/templates/**/*
- Cli/bash-completion
- Cli/templates/**/*
That was great!
This is what hpack generated, in .cabal file:
data-files:
Generator/templates/db/schema.prisma
Generator/templates/Dockerfile
Generator/templates/dockerignore
Generator/templates/react-app/gitignore
Generator/templates/react-app/netlify.toml
Generator/templates/react-app/package.json
Generator/templates/react-app/public/favicon.ico
Generator/templates/react-app/public/index.html
Generator/templates/react-app/public/manifest.json
Generator/templates/react-app/README.md
Generator/templates/react-app/src/actions/_action.js
Generator/templates/react-app/src/api.js
Generator/templates/react-app/src/auth/forms/Login.js
Generator/templates/react-app/src/auth/forms/Signup.js
Generator/templates/react-app/src/auth/login.js
Generator/templates/react-app/src/auth/logout.js
Generator/templates/react-app/src/auth/pages/createAuthRequiredPage.js
Generator/templates/react-app/src/auth/signup.js
Generator/templates/react-app/src/auth/useAuth.js
Generator/templates/react-app/src/config.js
Generator/templates/react-app/src/index.css
Generator/templates/react-app/src/index.js
Generator/templates/react-app/src/logo.png
Generator/templates/react-app/src/operations/index.js
Generator/templates/react-app/src/operations/resources.js
Generator/templates/react-app/src/queries/_query.js
Generator/templates/react-app/src/queries/index.js
Generator/templates/react-app/src/queryCache.js
Generator/templates/react-app/src/router.js
Generator/templates/react-app/src/serviceWorker.js
Generator/templates/react-app/src/utils.js
Generator/templates/server/gitignore
Generator/templates/server/npmrc
Generator/templates/server/nvmrc
Generator/templates/server/package.json
Generator/templates/server/README.md
Generator/templates/server/src/actions/_action.js
Generator/templates/server/src/app.js
Generator/templates/server/src/config.js
Generator/templates/server/src/core/auth.js
Generator/templates/server/src/core/auth/prismaMiddleware.js
Generator/templates/server/src/core/AuthError.js
Generator/templates/server/src/core/HttpError.js
Generator/templates/server/src/dbClient.js
Generator/templates/server/src/queries/_query.js
Generator/templates/server/src/routes/auth/index.js
Generator/templates/server/src/routes/auth/login.js
Generator/templates/server/src/routes/auth/me.js
Generator/templates/server/src/routes/auth/signup.js
Generator/templates/server/src/routes/index.js
Generator/templates/server/src/routes/operations/_action.js
Generator/templates/server/src/routes/operations/_query.js
Generator/templates/server/src/routes/operations/index.js
Generator/templates/server/src/server.js
Generator/templates/server/src/utils.js
Cli/bash-completion
Cli/templates/new/ext/Main.css
Cli/templates/new/ext/MainPage.js
Cli/templates/new/ext/waspLogo.png
However, I am migrating to Cabal now, and the restrictions above pose following problems:
For example, I now converted the above list of data files to:
data-files:
Generator/templates/Dockerfile
Generator/templates/dockerignore
Generator/templates/react-app/gitignore
Generator/templates/server/gitignore
Generator/templates/server/npmrc
Generator/templates/server/nvmrc
Generator/templates/**/*.prisma
Generator/templates/**/*.toml
Generator/templates/**/*.json
Generator/templates/**/*.ico
Generator/templates/**/*.html
Generator/templates/**/*.md
Generator/templates/**/*.js
Generator/templates/**/*.css
Generator/templates/**/*.png
Cli/bash-completion
Cli/templates/**/*.css
Cli/templates/**/*.js
Cli/templates/**/*.png
But I just know, that at some point, some team member will add a file to Generator/templates/
that has extension which we didn't cover here, and we will all spend some time figuring out why is that file not there -> even worse, we might not catch it with tests and ship it.
I would advocate for more relaxed rules, where *
for the whole file name including extension is allowed, and if possible trash or tmp files are a problem, let user care about that, or give them the ability to exclude such files with additional rules.
I would much rather by accident include a .swp file than by accident not include a file I need and risk a runtime error.
Please tell how to avoid cabal sdist
picking up .DS_Store
, style.css.swp
or style.css~
etc without hardcoding all possible temporory files into Cabal
.
ignored-files:
.DS_Store
*.swp
*~
Or we could bless git and have an option for including a .gitignore file.
That won't work. Package author/maintainer is not the only person who sdists. (e.g. source-repository-package
is one case).
OTOH, if people want to be able to shoot themselves into the foot, i guess there is no way to convince them it's a bad idea.
IMHO, globs were a bad idea, but then there weren't hpack
, cabal-fmt
like tools. .cabal
format shouldn't sacrifice explicitness for convenience. That is a bad trade-off.
Can you explain how it's a bad idea? I guess I have no concept of the impacted workflow you are using as a counterpoint.
IMHO, globs were a bad idea, but then there weren't
hpack
,cabal-fmt
like tools..cabal
format shouldn't sacrifice explicitness for convenience. That is a bad trade-off.
I'm very much in agreement on this. Tools like hpack
and hpack-dhall
do glob expansions well and can be a nice way to maintain an up to date and well formatted cabal file. How did we accept globs? They're hardly pure!
So should we update the docs? ".cabal format is read-only. Use a different tool [which?] to generate it."
My view is I believe we should allow specific extensions to glob expressivity which do not make it easy for users to accidentally add temp files. We should not add extensions which have a high chance of picking up temp and "junk" files.
Even in the example just given, I could very well imagine a user editing the templates in some editor they don't usually use, leaving around a tilde file, and that tilde file getting picked up in the next sdist. The case of having your directories of files you actually edit littered with all manner of temp files is not uncommon.
So while this request may look appealing at first, I expect it will cause problems for even the people who think "No, I'm too smart and careful, I wouldn't do that!" Sure, but can you say that for everyone else working on your project? Can you say that for yourself doing a last-minute sdist at 2am? Etc.
That said, I'm also very much open to ergonomic features that can improve the UX of working with this. Perhaps there could be some way to indicate to sdist that it should report all files in a directory that are not picked up by the regexps, and it can warn on check to force you to ensure that your patterns are comprehensive.
Perhaps there could be some way to indicate to sdist that it should report all files in a directory that are not picked up by the regexps, and it can warn on check to force you to ensure that your patterns are comprehensive.
Tests in CI. That's why e.g. haskell-ci
starts from making a sdist of a package, not with a checkout. That way if something is missing, CI catches it. (That's also why I don't like "read what files are there and run the procedure" -tests, as these "work" even there are no example tests in sdist. Somewhere the files should be listed explicitly!)
I was thinking e.g. desired-data-files:
which cabal check could diff against the actually enumerated files. 🤔
I have to admit, in the use case I described above, which is an everyday use case for me and my team, I have never thought of the possibility of edit/tmp/swp files finding their way in there and littering the list of data files. It just never occurred to me. So you are right that some people are not aware of it, and that it could happen if one is not careful. In practice, this could be solved by people configuring their editors to not produce such files all around the disk but in a central place, however there will be situations where that is not so.
On the other hand, not being able to specify relaxed enough globs, which results in likeliness of new files not being included when added by a team member that is not aware of this, is also not good. It can be prevented by having thorough testing that will catch the fact that file was not included in the runtime, but in practice many projects will not test that well enough, especially with IO testing not being a simple thing in Haskell, so there will be situations where this will result with runtime errors possibly.
So this is obviously a trade-off with no clear win -> one solution may result with errors in runtime, another might result with redundant files being packed with the project, both avoidable if people stick with best practices, but in practice they will happen. I would personally prefer redundant files over runtime errors, but that is subjective. If decision here was easy, we wouldn't have so many people in this thread holding different sides to this argument.
What I personally like the best as a possible approach, based on what others wrote above:
Relax the globs and ignore the files from .gitignore. (See EDIT below, I changed my mind while writing this) For me, this would be perfect. If there is a swp or tmp file that is not captured by .gitignore, I will see it anyway when I do git status
and take care of it then -> either decide it should go and I will commit it, or I will add the rule to gitignore, or delete it. Good thing with relaxed globs is that if somebody still wants to be more careful and use only globs with extensions, they can. But if I want to be more permissive in order to ensure everything from that directory subtree gets in my project, because I value having that insurance more than a possibility of redundant file (which is also greatly reduced if we do the .gitignore approach) I can do that. Even if not using the .gitignore approach from above, it seems to me that relaxing the globs might be a good way to go since it allows people to choose their approach -> no redundant files vs runtime errors.
Btw keep in mind that I am mostly coming from maintainer of a Haskell app, which is not getting published to Hackage and will never use sdist. I am also an author of Haskell package that is published on Hackage, so I do have some of that perspective, but there I never had to use data-files, on the other hand for this Haskell app we are using it intensively.
It seems to me like a lot of arguments here are coming from the perspective of Hackage package maintainer, but that is not the only use case.
p.s. What is the impact of accidentally having swp/tmp/redundant files included in package?
EDIT: The more I think about it, the more I realize that runtime errors are not so directly correlated with this. They could also happen if somebody deletes the file while globs are too relaxed since cabal will not complain. And if somebody just added file, even if they don't have tests, they will very likely at least manually test it, so it is unlikely they will not realize that it is not being included. So it mostly comes down to ergonomics, and this kind of negates my argument from before to extent. With this in mind, I am actually even leaning more on the side of restricted globs (as they are now)! I didn't expect this is how my piece here will end, but there it is.
What is the impact of accidentally having swp/tmp/redundant files included in package?
The worst case I can imagine is leaking sensitive data to the public, but runtime errors are possible too, for example if the program reads all data files assuming they are a certain format.
Idris2 doesn't use Haskell so we don't have this problem anymore, but I still would like the ability to use a glob that specifies a full file name at the end like tests/**/Makefile
in cabal, even if it has no extension. There is no risk for confusion with undesirable files there, and it's useful as there are a bunch of file types that traditionally doesn't use extensions.
I can't see a good reason those were blocked before. Added a PR that changes this (though of course extensions to glob syntax will require people use a sufficiently new cabal file version.)
Currently on https://www.haskell.org/cabal/users-guide/developing-packages.html#pkg-field-data-files
On Unix, there is no concept of "file extension". That is a Windows-only concept.
I have a directory of files I'd like to include in my cabal package:
These files have no extension.
They are part of a submodule of another project, so I can't easily change them.
I can't even use the well-fitting
syscalls-table/tables/syscalls-*
, also not allowed.I think we should remove above restriction from Cabal, allowing
data/*
.Users can still do
data/*.txt
and so on where applicable, but this restriction right now doens't help anybody and increases that things are packaged incorrectly due to extra code duplication to include all relevant files manually.