Remove the concept of "file extension" from data-files/extra-source-files wildcards

nh2 commented 5 years ago

Currently on https://www.haskell.org/cabal/users-guide/developing-packages.html#pkg-field-data-files

If a wildcard is used, it must be used with an extension, so data-files: data/* is not allowed.

On Unix, there is no concept of "file extension". That is a Windows-only concept.

I have a directory of files I'd like to include in my cabal package:

% ls -1U syscalls-table/tables/
syscalls-tile
syscalls-xtensa
syscalls-x86_64
...

These files have no extension.

They are part of a submodule of another project, so I can't easily change them.

I can't even use the well-fitting syscalls-table/tables/syscalls-*, also not allowed.

I think we should remove above restriction from Cabal, allowing data/*.

Users can still do data/*.txt and so on where applicable, but this restriction right now doens't help anybody and increases that things are packaged incorrectly due to extra code duplication to include all relevant files manually.

nh2 commented 5 years ago

Also from the docs:

The reason for providing only a very limited form of wildcard is to concisely express the common case of a large number of related files of the same file type without making it too easy to accidentally include unwanted files.

This rationale makes sense to motivat not adding super sophisticated globbing features, there's no point artificially restricting trivial use cases.

quasicomputational commented 5 years ago

Agreed that this restriction isn't pulling its weight. The implementation would actually have been simpler if this restriction had never existed, but we don't live in that world.

The only snare is that we'll want to have the less-restricted globs only if the cabal-version is sufficiently high and the previous behaviour if not, so there's no reduction in complexity. But defaults do matter and we should make the easy cases easy; the details about the history of the syntax allowed can be shunted off into an appendix or something, and I think we can test it well enough to give confidence in the correctness of the implementation.

melted commented 5 years ago

This is a big pain for Idris in getting it working with new-install. Every test for the compiler has files named run and expected, so we can't use wildcards as they are.

phadej commented 5 years ago

Iirc the extension is there to prevent sdisting tmp files, e.g. .dataFile.txt.swp or dataFile.txt~ or whatever hidden there.

And we don’t want maintain the list of all possible tmp file variants either.

On 13 Apr 2019, at 17.51, Niklas Larsson notifications@github.com wrote:

This is a big pain for Idris in getting it working with new-install. Every test for the compiler has files named run and expected, so we can't use wildcards as they are.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or mute the thread.

chreekat commented 5 years ago

How about at least letting '.' be considered a filename character? Cabal is currently telling me that foo/*.template isn't matching foo/bar.yml.template.

nh2 commented 5 years ago

Iirc the extension is there to prevent sdisting tmp files, e.g. .dataFile.txt.swp or dataFile.txt~ or whatever hidden there.

And we don’t want maintain the list of all possible tmp file variants either.

@phadej Then let's try to find another solution to exclude those, such as allowing the user to specify patterns to exclude.

Enabling a convenience feature (avoiding sdisting tempfiles) by making a normal use case completely impossible (globbing normal extensioneless files) is a severe restriction and makes cabal inconvenient to use.

The rest of the unix world won't add extensions to their files so that cabal can glob them.

phadej commented 5 years ago

@nh2, please tell when you find one.

"Allow user to specify patterns to exclude" is waiting for mistakes to happen: Including unnecessary files in sdist tarballs. And a mistake which is difficult to catch on CI (missing file is obvious).

So I'm simply :-1: on this, before some good and concrete plan forward is written down.

@chreekat use cabal-version: 2.4 https://cabal.readthedocs.io/en/latest/developing-packages.html#pkg-field-data-files

Prior to Cabal 2.4, when matching a wildcard plus extension, a file’s full extension must match exactly, so .gz matches foo.gz but not foo.tar.gz. This restriction has been lifted when cabal-version: 2.4 or greater so that .gz does match foo.tar.gz

chreekat commented 5 years ago

@phadej one solution would be to do what hpack does: ignore hidden files (starting with '.') and otherwise just use Unix globs. I believe this would cover both cases you mentioned: .dataFile.txt.swp would be ignored, and dataFile.txt~ would not match *.txt.

It's true that there is no perfect solution to this, but from a Linux user's perspective, some things are more surprising than others. Cabal's behavior is pretty far out there.

chreekat commented 5 years ago

Oh I just saw you pinged me as well. Thanks for the tip!

nh2 commented 5 years ago

ignore hidden files (starting with '.') and otherwise just use Unix globs. I believe this would cover both cases you mentioned: .dataFile.txt.swp would be ignored, and dataFile.txt~ would not match *.txt.

This sounds like a good approach to me: Ignoring dotfiles is a common behaviour, and for the case that one wants them, one could give data-files: .* explicitly.

melted commented 5 years ago

@phadej Why can't I use the form foo/**/Makefile as a wildcard? There is no risk for any confusion there.

Martinsos commented 2 years ago

+1 for removing the restrictions from globs when specifying data-files.

I have a directory data where I have a lot of template files which I need packaged with the binary. There are dozens of these files.

So far I was using Stack and I had following config:

data-dir: data/
data-files:
  - Generator/templates/**/*
  - Cli/bash-completion
  - Cli/templates/**/*

That was great!

This is what hpack generated, in .cabal file:

data-files:
    Generator/templates/db/schema.prisma
    Generator/templates/Dockerfile
    Generator/templates/dockerignore
    Generator/templates/react-app/gitignore
    Generator/templates/react-app/netlify.toml
    Generator/templates/react-app/package.json
    Generator/templates/react-app/public/favicon.ico
    Generator/templates/react-app/public/index.html
    Generator/templates/react-app/public/manifest.json
    Generator/templates/react-app/README.md
    Generator/templates/react-app/src/actions/_action.js
    Generator/templates/react-app/src/api.js
    Generator/templates/react-app/src/auth/forms/Login.js
    Generator/templates/react-app/src/auth/forms/Signup.js
    Generator/templates/react-app/src/auth/login.js
    Generator/templates/react-app/src/auth/logout.js
    Generator/templates/react-app/src/auth/pages/createAuthRequiredPage.js
    Generator/templates/react-app/src/auth/signup.js
    Generator/templates/react-app/src/auth/useAuth.js
    Generator/templates/react-app/src/config.js
    Generator/templates/react-app/src/index.css
    Generator/templates/react-app/src/index.js
    Generator/templates/react-app/src/logo.png
    Generator/templates/react-app/src/operations/index.js
    Generator/templates/react-app/src/operations/resources.js
    Generator/templates/react-app/src/queries/_query.js
    Generator/templates/react-app/src/queries/index.js
    Generator/templates/react-app/src/queryCache.js
    Generator/templates/react-app/src/router.js
    Generator/templates/react-app/src/serviceWorker.js
    Generator/templates/react-app/src/utils.js
    Generator/templates/server/gitignore
    Generator/templates/server/npmrc
    Generator/templates/server/nvmrc
    Generator/templates/server/package.json
    Generator/templates/server/README.md
    Generator/templates/server/src/actions/_action.js
    Generator/templates/server/src/app.js
    Generator/templates/server/src/config.js
    Generator/templates/server/src/core/auth.js
    Generator/templates/server/src/core/auth/prismaMiddleware.js
    Generator/templates/server/src/core/AuthError.js
    Generator/templates/server/src/core/HttpError.js
    Generator/templates/server/src/dbClient.js
    Generator/templates/server/src/queries/_query.js
    Generator/templates/server/src/routes/auth/index.js
    Generator/templates/server/src/routes/auth/login.js
    Generator/templates/server/src/routes/auth/me.js
    Generator/templates/server/src/routes/auth/signup.js
    Generator/templates/server/src/routes/index.js
    Generator/templates/server/src/routes/operations/_action.js
    Generator/templates/server/src/routes/operations/_query.js
    Generator/templates/server/src/routes/operations/index.js
    Generator/templates/server/src/server.js
    Generator/templates/server/src/utils.js
    Cli/bash-completion
    Cli/templates/new/ext/Main.css
    Cli/templates/new/ext/MainPage.js
    Cli/templates/new/ext/waspLogo.png

However, I am migrating to Cabal now, and the restrictions above pose following problems:

Some of the files don't have extensions, so I have to specify the full path to them manually. e.g. Dockerfile, bash-completion, dockerignore, gitignore, npmrc, nvmrc.
If somebody adds a new data file, and it has no extension or it doesn't have extension captured by current globs, it won't be packaged with the binary, likely resulting with the runtime error at some point. With Stack globs chance of that was relatively low since globs were pretty relaxed, but with these restricted globs this chance is much higher.

For example, I now converted the above list of data files to:

data-files:
  Generator/templates/Dockerfile
  Generator/templates/dockerignore
  Generator/templates/react-app/gitignore
  Generator/templates/server/gitignore
  Generator/templates/server/npmrc
  Generator/templates/server/nvmrc
  Generator/templates/**/*.prisma
  Generator/templates/**/*.toml
  Generator/templates/**/*.json
  Generator/templates/**/*.ico
  Generator/templates/**/*.html
  Generator/templates/**/*.md
  Generator/templates/**/*.js
  Generator/templates/**/*.css
  Generator/templates/**/*.png
  Cli/bash-completion
  Cli/templates/**/*.css
  Cli/templates/**/*.js
  Cli/templates/**/*.png

But I just know, that at some point, some team member will add a file to Generator/templates/ that has extension which we didn't cover here, and we will all spend some time figuring out why is that file not there -> even worse, we might not catch it with tests and ship it.

I would advocate for more relaxed rules, where * for the whole file name including extension is allowed, and if possible trash or tmp files are a problem, let user care about that, or give them the ability to exclude such files with additional rules. I would much rather by accident include a .swp file than by accident not include a file I need and risk a runtime error.

phadej commented 2 years ago

Please tell how to avoid cabal sdist picking up .DS_Store, style.css.swp or style.css~ etc without hardcoding all possible temporory files into Cabal.

chreekat commented 2 years ago

ignored-files:
    .DS_Store
    *.swp
    *~

Or we could bless git and have an option for including a .gitignore file.

phadej commented 2 years ago

That won't work. Package author/maintainer is not the only person who sdists. (e.g. source-repository-package is one case).

OTOH, if people want to be able to shoot themselves into the foot, i guess there is no way to convince them it's a bad idea.

IMHO, globs were a bad idea, but then there weren't hpack, cabal-fmt like tools. .cabal format shouldn't sacrifice explicitness for convenience. That is a bad trade-off.

chreekat commented 2 years ago

Can you explain how it's a bad idea? I guess I have no concept of the impacted workflow you are using as a counterpoint.

philderbeast commented 2 years ago

IMHO, globs were a bad idea, but then there weren't hpack, cabal-fmt like tools. .cabal format shouldn't sacrifice explicitness for convenience. That is a bad trade-off.

I'm very much in agreement on this. Tools like hpack and hpack-dhall do glob expansions well and can be a nice way to maintain an up to date and well formatted cabal file. How did we accept globs? They're hardly pure!

chreekat commented 2 years ago

So should we update the docs? ".cabal format is read-only. Use a different tool [which?] to generate it."

gbaz commented 2 years ago

My view is I believe we should allow specific extensions to glob expressivity which do not make it easy for users to accidentally add temp files. We should not add extensions which have a high chance of picking up temp and "junk" files.

Even in the example just given, I could very well imagine a user editing the templates in some editor they don't usually use, leaving around a tilde file, and that tilde file getting picked up in the next sdist. The case of having your directories of files you actually edit littered with all manner of temp files is not uncommon.

So while this request may look appealing at first, I expect it will cause problems for even the people who think "No, I'm too smart and careful, I wouldn't do that!" Sure, but can you say that for everyone else working on your project? Can you say that for yourself doing a last-minute sdist at 2am? Etc.

That said, I'm also very much open to ergonomic features that can improve the UX of working with this. Perhaps there could be some way to indicate to sdist that it should report all files in a directory that are not picked up by the regexps, and it can warn on check to force you to ensure that your patterns are comprehensive.

phadej commented 2 years ago

Perhaps there could be some way to indicate to sdist that it should report all files in a directory that are not picked up by the regexps, and it can warn on check to force you to ensure that your patterns are comprehensive.

Tests in CI. That's why e.g. haskell-ci starts from making a sdist of a package, not with a checkout. That way if something is missing, CI catches it. (That's also why I don't like "read what files are there and run the procedure" -tests, as these "work" even there are no example tests in sdist. Somewhere the files should be listed explicitly!)

gbaz commented 2 years ago

I was thinking e.g. desired-data-files: which cabal check could diff against the actually enumerated files. 🤔

Martinsos commented 2 years ago

I have to admit, in the use case I described above, which is an everyday use case for me and my team, I have never thought of the possibility of edit/tmp/swp files finding their way in there and littering the list of data files. It just never occurred to me. So you are right that some people are not aware of it, and that it could happen if one is not careful. In practice, this could be solved by people configuring their editors to not produce such files all around the disk but in a central place, however there will be situations where that is not so.

On the other hand, not being able to specify relaxed enough globs, which results in likeliness of new files not being included when added by a team member that is not aware of this, is also not good. It can be prevented by having thorough testing that will catch the fact that file was not included in the runtime, but in practice many projects will not test that well enough, especially with IO testing not being a simple thing in Haskell, so there will be situations where this will result with runtime errors possibly.

So this is obviously a trade-off with no clear win -> one solution may result with errors in runtime, another might result with redundant files being packed with the project, both avoidable if people stick with best practices, but in practice they will happen. I would personally prefer redundant files over runtime errors, but that is subjective. If decision here was easy, we wouldn't have so many people in this thread holding different sides to this argument.

What I personally like the best as a possible approach, based on what others wrote above: Relax the globs and ignore the files from .gitignore. (See EDIT below, I changed my mind while writing this) For me, this would be perfect. If there is a swp or tmp file that is not captured by .gitignore, I will see it anyway when I do git status and take care of it then -> either decide it should go and I will commit it, or I will add the rule to gitignore, or delete it. Good thing with relaxed globs is that if somebody still wants to be more careful and use only globs with extensions, they can. But if I want to be more permissive in order to ensure everything from that directory subtree gets in my project, because I value having that insurance more than a possibility of redundant file (which is also greatly reduced if we do the .gitignore approach) I can do that. Even if not using the .gitignore approach from above, it seems to me that relaxing the globs might be a good way to go since it allows people to choose their approach -> no redundant files vs runtime errors.

Btw keep in mind that I am mostly coming from maintainer of a Haskell app, which is not getting published to Hackage and will never use sdist. I am also an author of Haskell package that is published on Hackage, so I do have some of that perspective, but there I never had to use data-files, on the other hand for this Haskell app we are using it intensively.

It seems to me like a lot of arguments here are coming from the perspective of Hackage package maintainer, but that is not the only use case.

p.s. What is the impact of accidentally having swp/tmp/redundant files included in package?

EDIT: The more I think about it, the more I realize that runtime errors are not so directly correlated with this. They could also happen if somebody deletes the file while globs are too relaxed since cabal will not complain. And if somebody just added file, even if they don't have tests, they will very likely at least manually test it, so it is unlikely they will not realize that it is not being included. So it mostly comes down to ergonomics, and this kind of negates my argument from before to extent. With this in mind, I am actually even leaning more on the side of restricted globs (as they are now)! I didn't expect this is how my piece here will end, but there it is.

fgaz commented 2 years ago

What is the impact of accidentally having swp/tmp/redundant files included in package?

The worst case I can imagine is leaking sensitive data to the public, but runtime errors are possible too, for example if the program reads all data files assuming they are a certain format.

melted commented 2 years ago

Idris2 doesn't use Haskell so we don't have this problem anymore, but I still would like the ability to use a glob that specifies a full file name at the end like tests/**/Makefile in cabal, even if it has no extension. There is no risk for confusion with undesirable files there, and it's useful as there are a bunch of file types that traditionally doesn't use extensions.

gbaz commented 2 years ago

I can't see a good reason those were blocked before. Added a PR that changes this (though of course extensions to glob syntax will require people use a sufficiently new cabal file version.)

haskell / cabal

Remove the concept of "file extension" from data-files/extra-source-files wildcards #5883