qiskit-community / qiskit-nature

Qiskit Nature is an open-source, quantum computing, framework for solving quantum mechanical natural science problems.
https://qiskit-community.github.io/qiskit-nature/
Apache License 2.0
301 stars 204 forks source link

PySCF QCSchema is not JSON serializable #1363

Open S-Erik opened 4 months ago

S-Erik commented 4 months ago

Environment

What is happening?

The QCSchema object from a PySCFDriver object is not JSON serializable, i.e. the to_json() method used on a QCSchema object from a PySCFDriver object results in the error TypeError: Object of type int64 is not JSON serializable.

How can we reproduce the issue?

The following code results in the error TypeError: Object of type int64 is not JSON serializable:

from qiskit_nature.units import DistanceUnit
from qiskit_nature.second_q.drivers import PySCFDriver

driver = PySCFDriver(
    atom="H 0 0 0; H 0 0 0.735",
    basis="sto3g",
    charge=0,
    spin=0,
    unit=DistanceUnit.ANGSTROM,
)

problem = driver.run()

schema = driver.to_qcschema()

# Trying to convert QCSchema to JSON
schema.to_json()

Output:

Traceback (most recent call last):
  File "pyscf_json.py", line 17, in <module>
    schema.to_json()
  File "../qiskit-nature/qiskit_nature/second_q/formats/qcschema/qc_base.py", line 67, in to_json
    return json.dumps(self.to_dict(), indent=2)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/json/__init__.py", line 238, in dumps
    **kw).encode(obj)
          ^^^^^^^^^^^
  File "/usr/lib/python3.12/json/encoder.py", line 202, in encode
    chunks = list(chunks)
             ^^^^^^^^^^^^
  File "/usr/lib/python3.12/json/encoder.py", line 432, in _iterencode
    yield from _iterencode_dict(o, _current_indent_level)
  File "/usr/lib/python3.12/json/encoder.py", line 406, in _iterencode_dict
    yield from chunks
  File "/usr/lib/python3.12/json/encoder.py", line 406, in _iterencode_dict
    yield from chunks
  File "/usr/lib/python3.12/json/encoder.py", line 326, in _iterencode_list
    yield from chunks
  File "/usr/lib/python3.12/json/encoder.py", line 439, in _iterencode
    o = _default(o)
        ^^^^^^^^^^^
  File "/usr/lib/python3.12/json/encoder.py", line 180, in default
    raise TypeError(f'Object of type {o.__class__.__name__} '
TypeError: Object of type int64 is not JSON serializable

What should happen?

I guess the PySCFDriver object should be JSON serializable and the code above should run without errors.

Any suggestions?

PySCF saves some properties as numpy scalars, e.g. the Mole.nao property which is extracted here in pyscfdriver.py and is of type numpy.int64. Theses numpy scalars are not further processed or transformed into native python types (such as int, float, etc.), e.g. in electronic_structure_driver.py or in pyscfdriver.py. This leads to the fact that some properties in the QCSchema object are not JSON serializable, since they are numpy scalars. Also the PySCF property atom_mass_list is also wrongly converted to a python list here which results in a list of numpy scalars instead of a list of python types. Since atom_mass_list is a numpy ndarray the tolist method should be used in my opinion.

Therefore, I suggest the following changes to the files electronic_structure_driver.py and pyscfdriver.py: pyscfdriver.py line 558:

- data.masses = list(self._mol.atom_mass_list())
+ data.masses = self._mol.atom_mass_list().tolist()

electronic_structure_driver.py line 227 onward:

-        properties = QCProperties()
-        properties.calcinfo_natom = len(data.symbols) if data.symbols is not None else None
-        properties.calcinfo_nbasis = data.nbasis
-        properties.calcinfo_nmo = data.nmo
-        properties.calcinfo_nalpha = data.nalpha
-        properties.calcinfo_nbeta = data.nbeta
-        properties.return_energy = data.e_ref
-        properties.nuclear_repulsion_energy = data.e_nuc
-        properties.nuclear_dipole_moment = data.dip_nuc
-        properties.scf_dipole_moment = data.dip_ref

-        def format_np_array(arr):
-            if isinstance(arr, Tensor):
-                # NOTE: this also deals with symmetry-reduced integral classes and ensures that
-                # they are not automatically unfolded to 1-fold symmetry
-                arr = arr.array
-            return arr.ravel().tolist()

+        def format_np_generic(value):
+            # Convert numpy generic types, like numpy.int64, to their Python equivalents
+            if isinstance(value, np.generic):
+                value = value.item()
+            return value

+        def format_np_array(arr):
+            if isinstance(arr, Tensor):
+                # NOTE: this also deals with symmetry-reduced integral classes and ensures that
+                # they are not automatically unfolded to 1-fold symmetry
+                arr = arr.array
+            return arr.ravel().tolist()

+        properties = QCProperties()
+        properties.calcinfo_natom = len(data.symbols) if data.symbols is not None else None
+        properties.calcinfo_nbasis = format_np_generic(data.nbasis)
+        properties.calcinfo_nmo = format_np_generic(data.nmo)
+        properties.calcinfo_nalpha = format_np_generic(data.nalpha)
+        properties.calcinfo_nbeta = format_np_generic(data.nbeta)
+        properties.return_energy = format_np_generic(data.e_ref)
+        properties.nuclear_repulsion_energy = format_np_generic(data.e_nuc)
+        properties.nuclear_dipole_moment = format_np_array(data.dip_nuc)
+        properties.scf_dipole_moment = format_np_array(data.dip_ref)

electronic_structure_driver.py line 335:

- return_result=data.e_ref,
+ return_result=format_np_generic(data.e_ref),

Further, it seems reasonable to me to add a unittest testing the to_json and to_hdf5 methods of the PySCFDriver. I am thinking about something like this in test_driver_pyscf.py:

    def test_to_json(self):
        """Check JSON-serializability of the driver"""
        driver = PySCFDriver(
            atom="H .0 .0 .0; H .0 .0 0.735",
            unit=DistanceUnit.ANGSTROM,
            charge=0,
            spin=0,
            basis="sto3g",
        )
        _driver_result = driver.run()
        schema = driver.to_qcschema()
        schema.to_json()

    def test_to_hdf5(self):
        """Check HDF5-serializability of the driver"""
        driver = PySCFDriver(
            atom="H .0 .0 .0; H .0 .0 0.735",
            unit=DistanceUnit.ANGSTROM,
            charge=0,
            spin=0,
            basis="sto3g",
        )
        _driver_result = driver.run()
        schema = driver.to_qcschema()

        with TemporaryDirectory() as tmp_dir:
            file_path = Path(tmp_dir) / "tmp.hdf5"
            with h5py.File(file_path, "w") as file:
                schema.to_hdf5(file)

Please tell me your opinion on the suggested changes and I can prepare a pull request to resolve this issue, if you wish.