graphistry / pygraphistry

PyGraphistry is a Python library to quickly load, shape, embed, and explore big graphs with the GPU-accelerated Graphistry visual graph analyzer
BSD 3-Clause "New" or "Revised" License
2.11k stars 205 forks source link

[BUG] ValueError: Expected Pandas/Arrow/cuDF/Spark dataframe(s) or igraph/NetworkX graph when calling spark.sql() #556

Open DataBoyTX opened 5 months ago

DataBoyTX commented 5 months ago

Describe the bug

The following code used to work, but is now throwing an error, assuming the datatype of the resulting df changed from SparkDataFrame to pyspark.sql.connect.dataframe.DataFrame

df = spark.sql("SELECT * FROM honeypot")

g2 = graphistry.edges(df, 'attackerIP', 'victimIP')

g2.plot()

simply adding .toPandas() to the df on input to edges() fixes the problem, but we should handle in the client.

error:


ValueError: Expected Pandas/Arrow/cuDF/Spark dataframe(s) or igraph/NetworkX graph.
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
File <command-2934552628071172>, line 1
----> 1 g.plot()

File /local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.10/site-packages/graphistry/PlotterBase.py:1404, in PlotterBase.plot(self, graph, nodes, name, description, render, skip_upload, as_files, memoize, extra_html, override_html_style)
   1401 PyGraphistry.refresh()
   1402 logger.debug("4. @PloatterBase plot: PyGraphistry.org_name(): {}".format(PyGraphistry.org_name()))
-> 1404 dataset = self._plot_dispatch(g, n, name, description, 'arrow', self._style, memoize)
   1405 if skip_upload:
   1406     return dataset

File /local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.10/site-packages/graphistry/PlotterBase.py:1701, in PlotterBase._plot_dispatch(self, graph, nodes, name, description, mode, metadata, memoize)
   1698 except ImportError:
   1699     pass
-> 1701 error('Expected Pandas/Arrow/cuDF/Spark dataframe(s) or igraph/NetworkX graph.')

File /local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.10/site-packages/graphistry/util.py:280, in error(msg)
    279 def error(msg):
--> 280     raise ValueError(msg)

ValueError: Expected Pandas/Arrow/cuDF/Spark dataframe(s) or igraph/NetworkX graph.

To Reproduce

Lab 2 - Data Preparation and Styling-ExpectedPandasArrowSparkDataframe.zip

lmeyerov commented 5 months ago

We should support multiple spark versions, sounds like impacts potentially these: