pypa / hatch

Modern, extensible Python project management
https://hatch.pypa.io/latest/
MIT License
6.11k stars 309 forks source link

Prevent matching bogus parent gitignores #1643

Closed jameshilliard closed 2 weeks ago

jameshilliard commented 4 months ago

If our project root directory is matched by an exclude pattern we should assume the pattern is invalid as our project root is likely in an ignored build directory of another project.

Should fix: #1641

This is essentially a port of the same technique I used to fix a similar poetry core bug.

jameshilliard commented 2 months ago

@ofek Think you could take a look at this?

jameshilliard commented 2 months ago

FYI tests on master are currently broken same as here so I don't think this is introducing any regressions.

mv1005 commented 2 months ago

Hi there,

I just spent some time on trying to fix my project not to produce empty wheels until I stumbled over this issue.

Why I am writing here is because it was surprising me that such simple condition (at least as appearing to me) like "do not cross project root when scanning for certain files" really is a concern here.

My first thought was: Wouldn't it be enough to simply stop scanning upwards the file tree when we encounter a pyproject.toml or a hatch.toml? As I understand, those files mark the project root. My suggestion obviously is too strict if there is any valid use case for having (relevant) .gitignore-files outside the project root. Are there any?

Markus

jameshilliard commented 2 months ago

Wouldn't it be enough to simply stop scanning upwards the file tree when we encounter a pyproject.toml or a hatch.toml?

Technically that would also prevent the issue I was seeing...however that would differ from git's normal gitignore behavior so it may not be desirable.

My suggestion obviously is too strict if there is any valid use case for having (relevant) .gitignore-files outside the project root. Are there any?

If the python project root is in a subdirectory of a git project root directory it's possible that someone may want to use the gitignore above the python project root subdirectory but within the git project root directory to ignore specific files, so there may be a use case here for wanting to try and replicate git's behavior more closely.

The logic of this check as I implemented it should only filter out the use of gitignore in cases where the python project root directory is itself entirely ignored(a case where it is effectively guaranteed that the python project is not tracked by git at all), while still allowing for files within the project root directory to be ignored if needed using gitignores above the project root.

mv1005 commented 2 months ago

I see that there are valid use cases for having a relevant .gitignore outside the project root.

In my case, the situation was a bit special, since the python project I was working on was completely unrelated to any VCS. It was just a random project sitting in my user's home.

Ages ago, I decided (and almost forgot about) to put some of the config files in my home under version control with git (literally making my whole home a git repo). Now, to not have random stuff under version control, I used kind of inverse git-ignore logic with a .gitignore like this:

*
!.gitignore
!.bashrc
!.profile
!.bash_aliases
!.bash_profile
!.gitconfig
!.gitpwd.sh
!.Xmodmap
!.vimrc
!.pinerc
!.Xdefaults
[...]

I absolutely admit that this is a very rare case, and I also know that there are better/other ways today to manage user configuration. I just wanted to point out that there might be use cases that involve .gitignore files that are completely unrelated to the python project in question. It is not obvious and at least to me really unexpected that hatch even looks at files outside the project root. That was the reason for me having hard times to figure out what's going on here, and still it feels somewhat strange that "some entity" like hatch crawls up the whole file system and considers any .gitignore it finds.

What if in some completely unrelated .gitignore is a pattern like this?

*.py

I assume this still would break things.

What would have helped me a lot to spot the issue in a straighter fashion:

jameshilliard commented 2 months ago

In my case, the situation was a bit special, since the python project I was working on was completely unrelated to any VCS. It was just a random project sitting in my user's home.

This is designed to handle cases like that, it's designed for really any case where the project root is itself ignored. Does it not work for you?

I just wanted to point out that there might be use cases that involve .gitignore files that are completely unrelated to the python project in question.

I mean, that's effectively what this was designed for, in my case I have a fully ignored python project root build directory inside a meta build system.

It is not obvious and at least to me really unexpected that hatch even looks at files outside the project root.

Yeah, it's not very obvious, so much so that this sort of behavior of matching bogus parent gitignore files was even independently re-implemented in virtually every newer pep517 build backend in some form or another. I managed to fix most of them already but hatch and flit(but not flit-core so there's much less real world issues for downstream integrators) AFAIU are the only ones remaining with fixes pending for this type of bug.

What if in some completely unrelated .gitignore is a pattern like this?

*.py

I assume this still would break things.

If the project root is ignored then this change should disable the use of .gitignore files entirely so having additional more specific ignores like *.py shouldn't be an issue, if you have a parent gitignore which does not ignore the project root then it's assumed that the project is actually tracked by git(which should replicate git's behavior) and thus the ignores will be used. I think this results in the least surprising and expected behavior.

mv1005 commented 2 months ago

This is designed to handle cases like that, it's designed for really any case where the project root is itself ignored. Does it not work for you?

Up to know, I just assumed that it would work :slightly_smiling_face:. But to confirm that, I ran a quick test by installing your version directly from git inside a virtualenv.

$ pip freeze | grep hatch
hatch @ git+https://github.com/jameshilliard/hatch.git@10d23087d03c7ba51a5564402701d01de0f6f6f6
hatchling==1.25.0
$ hatch --version
Hatch, version 1.12.1.dev18
$ hatch build -t wheel
dist/photo_importer-0.0.1-py3-none-any.whl
$ 7za l dist/photo_importer-0.0.1-py3-none-any.whl 
Listing archive: dist/photo_importer-0.0.1-py3-none-any.whl
[...]

   Date      Time    Attr         Size   Compressed  Name
------------------- ----- ------------ ------------  ------------------------
2020-02-02 00:00:00 .....         1559          593  photo_importer-0.0.1.dist-info/METADATA
2020-02-02 00:00:00 .....           87           86  photo_importer-0.0.1.dist-info/WHEEL
2020-02-02 00:00:00 .....           48           49  photo_importer-0.0.1.dist-info/entry_points.txt
2020-02-02 00:00:00 .....         1097          642  photo_importer-0.0.1.dist-info/licenses/LICENSE.txt
2020-02-02 00:00:00 .....          437          267  photo_importer-0.0.1.dist-info/RECORD
------------------- ----- ------------ ------------  ------------------------
2020-02-02 00:00:00               3228         1637  5 files

As you can see, there are no package files contained in the wheel (there should be a package named photo_importer).

If I either remove my .gitignore file from my home or set the option

[tool.hatch.build]
ignore-vcs = true

in pyproject.toml I get a wheel with the expected contents:

   Date      Time    Attr         Size   Compressed  Name
------------------- ----- ------------ ------------  ------------------------
2020-02-02 00:00:00 .....          127          121  photo_importer/__about__.py
2020-02-02 00:00:00 .....          105           99  photo_importer/__init__.py
2020-02-02 00:00:00 .....         8090         2592  photo_importer/phi.py
2020-02-02 00:00:00 .....         1559          593  photo_importer-0.0.1.dist-info/METADATA
2020-02-02 00:00:00 .....           87           86  photo_importer-0.0.1.dist-info/WHEEL
2020-02-02 00:00:00 .....           48           49  photo_importer-0.0.1.dist-info/entry_points.txt
2020-02-02 00:00:00 .....         1097          642  photo_importer-0.0.1.dist-info/licenses/LICENSE.txt
2020-02-02 00:00:00 .....          680          405  photo_importer-0.0.1.dist-info/RECORD
------------------- ----- ------------ ------------  ------------------------
2020-02-02 00:00:00              11793         4587  8 files

It looks like your fix indeed does not work for my case. Did I miss something?

jameshilliard commented 2 months ago

It looks like your fix indeed does not work for my case. Did I miss something?

Hmm, try with print(patterns) above this line and see what it's showing.

jameshilliard commented 2 months ago

$ pip freeze | grep hatch hatch @ git+https://github.com/jameshilliard/hatch.git@10d23087d03c7ba51a5564402701d01de0f6f6f6 hatchling==1.25.0

I think this is your issue, you forgot to actually install my change from this PR, the change is part of the hatchling package not the hatch package, this git repo contains both projects but I only made changes to hatchling.

I tried replicating your environment and it seems that was the issue, also I'm not sure which hatchling package the hatch build -t wheel command uses for the build backend but it seems to not be picking up the one installed in the active virtualenv for some reason, I testing wheel builds with the standard build frontend which seems to work as expected. Presumably if the pypi hatchling version contains this change it would also work.

Currently doesn't work:

hatch build -t wheel

Works:

python -m build --no-isolation .
mv1005 commented 2 months ago

Thanks for testing with respect to my setup.

I fixed my test environment and now have:

$ pip freeze | grep hatch
-e git+https://github.com/jameshilliard/hatch.git@10d23087d03c7ba51a5564402701d01de0f6f6f6#egg=hatch
-e git+https://github.com/jameshilliard/hatch.git@10d23087d03c7ba51a5564402701d01de0f6f6f6#egg=hatchling&subdirectory=backend

I installed in editable mode from a local working copy in order to be able to quickly switch branches.

Hereby I can confirm:

I verified that none of the both calls above works when running from master, so your fix is effective.

I investigated a bit more what code actually runs when using hatch build.

In my estimation, using hatch build is a very common use case, so we should make it work somehow too. I probably could dig deeper into the code next week to find out the differences between the actually executed code paths of the two build command flavors.

mv1005 commented 1 month ago

Ok, it took me a while, but here are my findings.

The reason why hatch build does not pick up local changes you eventually applied to the hatchling backend simply is because it installs an isolated fresh version of hatchling from upstream when setting up it's internal build environment.

I did not dig deeper into the code of hatch or the build config options if there is any way to override this behavior.

As a test, I forcibly installed your patched version of hatchling into the isolated build environment that hatch has created for my project. After doing so, hatch build generates a wheel containing all files as expected :tada:.

As a conclusion:

Update:

Just after writing this, I found the following way to make hatch automatically use your local version of the build backend.

Set the following in your pyproject.toml:

[build-system]                                                                                                                                                            
requires = ["hatchling @ file:///path/to/your/local/hatch/backend"]                                                                             
build-backend = "hatchling.build" 
jameshilliard commented 1 month ago

rebased

ofek commented 2 weeks ago

Do you have an example so that I could reproduce?

jameshilliard commented 2 weeks ago

Do you have an example so that I could reproduce?

Just build any package using hatchling from an extracted sdist inside a gitignored directory, for example:

If our top level working git directory is:

/home/buildroot/buildroot/.git

With * in this gitignore file:

/home/buildroot/buildroot/output/.gitignore

Then when building a sdist python package from the following build directory which is extracted from the package sdist:

/home/buildroot/buildroot/output/build/packagename

We will end up with an empty wheel file as all paths underneath this are ignored and incorrectly excluded by hatchling:

/home/buildroot/buildroot/output
jameshilliard commented 2 weeks ago

rebased again

jameshilliard commented 2 weeks ago

I think there is an edge case lurking here regarding matching. I feel like rather than checking if the absolute path matches the logic should check the relative path.

Hmm, so this may be a pre-existing bug, in buildroot we are in fact currently relying on hatchling not parsing a parent .gitignore with an absolute path correctly in order to avoid triggering the bug this PR was intended to fix. Weirdly maturin appears to have completely independently implemented this exact same bug.

For example hatchling doesn't pick up a /output exclude path in:

/home/buildroot/buildroot/.gitignore

But does pick up a * exclude path in:

/home/buildroot/buildroot/output/.gitignore

This is despite to my understanding these having identical meaning by git itself.

ofek commented 2 weeks ago

What do you think I should do?

jameshilliard commented 2 weeks ago

What do you think I should do?

Probably you would initially want to go through and root cause the reason why hatchling's gitignore logic is not following normal git rules(i.e. not picking up paths like /output in the above example).

Maybe also try and make sure gitignore path handling is consistent in regards to relative/absolute paths everywhere in the codebase. The relative/absolute path handling in hatchling is at least for me somewhat tricky to reason about due to the complexity of the config.py code. Possibly this complexity is largely unavoidable but I'd assume there would be at least some opportunities there for refactoring in order to simplify things somewhat.

I think there is an edge case lurking here regarding matching.

BTW I'm quite confident the general approach I'm using to fixing the bogus parent matching bug in this PR is overall fairly reliable(since it's worked reliably for a while in other projects like poetry), although I'm likewise not entirely sure about all the edge case in regards to absolute/relative paths are being handled correctly here.