materialsproject / maggma

MongoDB aggregation machine
https://materialsproject.github.io/maggma/
Other
38 stars 32 forks source link

[Bug]: numpy 2.0 breaks OpenData store (probable pandas bug) #995

Open rkingsbury opened 2 months ago

rkingsbury commented 2 months ago

Since updating dependencies to allow numpy 2.0 (#986 ), we have a test failure in the OpenData store that is being triggered by pandas. See, for example, this failed test run.

The exception raised is

   def resolve(self, key: str, is_local: bool):
        """
        Resolve a variable name in a possibly local context.

        Parameters
        ----------
        key : str
            A variable name
        is_local : bool
            Flag indicating whether the variable is local or not (prefixed with
            the '@' symbol)

        Returns
        -------
        value : object
            The value of a particular variable
        """
        try:
            # only look for locals in outer scope
            if is_local:
                return self.scope[key]

            # not a local variable so check in resolvers if we have them
            if self.has_resolvers:
                return self.resolvers[key]

            # if we're here that means that we have no locals and we also have
            # no resolvers
            assert not is_local and not self.has_resolvers
            return self.scope[key]
        except KeyError:
            try:
                # last ditch effort we look in temporaries
                # these are created when parsing indexing expressions
                # e.g., df[df > 0]
                return self.temps[key]
            except KeyError as err:
>               raise UndefinedVariableError(key, is_local) from err
E               pandas.errors.UndefinedVariableError: name 'np' is not defined

/opt/hostedtoolcache/Python/3.9.19/x64/lib/python3.9/site-packages/pandas/core/computation/scope.py:244: UndefinedVariableError

The test triggering the failure is test_update:

def test_update(s3store):
    assert len(s3store.index_data) == 2
    s3store.update(
        pd.DataFrame(
            [
                {
                    "task_id": "mp-199999",
                    "data": "asd",
                    "group": {"level_two": 4},
                    s3store.last_updated_field: datetime.utcnow(),
                }
            ]
        )
    )

I did some debugging on the resolve function in pandas (see source code here) and determined that the number in {"level_two": 4}, is getting turned into a np.int64 and that the key and is_local args to the resolve function are np and False, respectively.

Somewhere, pandas is getting confused and using np as a variable name. I'm not sure how or why this is happening but I have a feeling it is a pandas bug. The following may be relevant:

https://github.com/pandas-dev/pandas/issues/54252

https://numpy.org/devdocs/numpy_2_0_migration_guide.html#windows-default-integer

Version

latest

Which OS?

rkingsbury commented 2 months ago

@kbuma would you (or others very familiar with this Store) be able to investigate this and, if appropriate, open a bug report in pandas?

This appears to only affect the OpenDataStore, so I'm interested in opinions from @kbuma , @jmmshn , @munrojm , and @yang-ruoxi as to whether it would be less disruptive to 1) continue to allow maggma to install with numpy 2.0 (with a broken opendatastore) or 2) revert #986 and wait on numpy 2.0 support until this is fixed.

rkingsbury commented 2 months ago

Tagging #848 and #991 as related

kbuma commented 2 months ago

i'll look into it

rkingsbury commented 2 months ago

Thank you @kbuma !

kbuma commented 2 months ago

@rkingsbury per comments in https://github.com/pandas-dev/pandas/issues/59668 "pandas does not have a released version that officially supports numpy 2.1." as of yet.

One possible path forward is to remove the open data stores (and thus the Pandas dependency) from the maggma project - we ended up using a different approach for the new builders so are not actively using these stores in MP infra. Let me know if that is an option.

rkingsbury commented 2 months ago

Thanks @kbuma ! FWIW, I believe I saw the bug with numpy>2 < 2.1 as well. Regardless, it's helpful to know that no one (at least no one in MP) is using OpenDataStore.

I like the idea of having some way of interfacing maggma with pandas eventually, so I'm hesitant to remove the class wholesale. What I suggest instead is

  1. disable the failing test (mark xfail)
  2. Issue a warning on OpenDataStore.__init__ to use with caution, because the store is in a a deprecated / unfinished / untested state, and in particular may be incompatible with numpy 2.0+.

That way we can have tests passing but allow future users to build on the nice work that went into OpenDataStore. What do you think? If you agree, would you be able to open a PR with those changes?

kbuma commented 2 months ago

Thanks @rkingsbury ! I agree with your suggestion and will look to find some time this week to put together a PR.