krishnanlab / PecanPy

A fast, parallelized, memory efficient, and cache-optimized Python implementation of node2vec
BSD 3-Clause "New" or "Revised" License
152 stars 22 forks source link

csr_matrices #122

Open kwchurch opened 2 years ago

kwchurch commented 2 years ago

I have a large csr_matrix in npz format. I'd like to use that as input as is, but it doens't have IDs field

added this to graph.py (but it doesn't work)

if 'IDs' in raw:
    self.set_node_ids(raw["IDs"].tolist())
else:
    # added by kwc                                                                                                                                                                                                                          
    self.set_node_ids(np.arange(raw["shape"][0]).tolist())

Created edg2npz.py with this:

import numpy as np
import scipy.sparse
import sys

dtype=bool
if sys.argv[2] == "int":
    dtype=int

X=[]
Y=[]

for line in sys.stdin:
    fields = line.rstrip().split()
    if len(fields) >= 2:
    x,y = fields[0:2]
    X.append(int(x))
        Y.append(int(y))

X = np.array(X, dtype=np.int32)
Y = np.array(Y, dtype=np.int32)
N = 1+max(np.max(X), np.max(Y))
V = np.ones(len(X), dtype=bool)

M = scipy.sparse.csr_matrix((V, (X, Y)), dtype=dtype, shape=(N,N))

scipy.sparse.save_npz(sys.argv[1], M)

called it with

python edg2npz.py demo/karate.bool.npz bool < demo/karate.edg 

Unfortunately, I can't use this kind of csr_matrix...

I can write out my matrix to text and then run pecanpy on that, but my matrix is very large and it will take a long time to write it out and read it back. My matrix has N = 300M nodes and E=2B nonzero edges.

 pecanpy --input demo/karate.bool.npz --output demo/karate.int.emb --mode SparseOTF
init pecanpy: p = 1, q = 1, workers = 1, verbose = False, extend = False, gamma = 0, random_state = None
WARNING: when p = 1 and q = 1 with unweighted graph, highly recommend using the FirstOrderUnweighted over SparseOTF. The runtime could be improved greatly with improved  memory usage.
Took 00:00:00.02 to load Graph
Took 00:00:00.00 to pre-compute transition probabilities
Traceback (most recent call last):
  File "/home/k.church/venv/gft/bin/pecanpy", line 8, in <module>
    sys.exit(main())
  File "/home/k.church/venv/gft/lib/python3.8/site-packages/pecanpy/cli.py", line 333, in main
    walks = simulate_walks(args, g)
  File "/home/k.church/venv/gft/lib/python3.8/site-packages/pecanpy/wrappers.py", line 18, in wrapper
    result = func(*args, **kwargs)
  File "/home/k.church/venv/gft/lib/python3.8/site-packages/pecanpy/cli.py", line 320, in simulate_walks
    return g.simulate_walks(args.num_walks, args.walk_length)
  File "/home/k.church/venv/gft/lib/python3.8/site-packages/pecanpy/pecanpy.py", line 153, in simulate_walks
    walk_idx_mat = self._random_walks(
  File "/home/k.church/venv/gft/lib/python3.8/site-packages/numba/core/dispatcher.py", line 468, in _compile_for_args
    error_rewrite(e, 'typing')
  File "/home/k.church/venv/gft/lib/python3.8/site-packages/numba/core/dispatcher.py", line 409, in error_rewrite
    raise e.with_traceback(None)
numba.core.errors.TypingError: Failed in nopython mode pipeline (step: nopython frontend)
Failed in nopython mode pipeline (step: nopython frontend)                                                                                                                                                                                          
Failed in nopython mode pipeline (step: nopython frontend)                                                                                                                                                                                          
No implementation of function Function(<built-in function itruediv>) found for signature:                                                                                                                                                           

 >>> itruediv(array(bool, 1d, C), Literal[int](1))                                                                                                                                                                                                  

There are 6 candidate implementations:

RemyLau commented 2 years ago

Hi @kwchurch, thank you for the detailed dev log! I slightly edited the format to further improve the readability. At a first glance, it looks to me like an issue of incompatible dtype. More specifically, the csr used by PecanPy uses uint32 for both the index and indptr fields, rather than int32 as used by scipy.sparse.csr. Similarly, PecanPy uses float32 instead of float64 for the data field in the csr object.

I think to resolve the type issue, the most straightforward solution is to enforce the desired types (i.e., float32 for data; uint32 for indices and `indptr) at loading time: https://github.com/krishnanlab/PecanPy/blob/49d60630b4589eeab992eef2da9c2eaf6b19fab8/src/pecanpy/graph.py#L432-L438

I will first try to reproduce the error here using the example script you provided, and then see if my proposed solution actually fixes the issue.

As we also discussed, I will add the option for implicitly assigning node IDs if it is not found in the .csr.npz file. I will make it so that it requires a "soft confirmation" from the user that the implicit assignment is desired by printing a warning message about the implicit assignment, unless a specific flag (e.g., --implicit_node_ids) is set.

RemyLau commented 2 years ago

Hi @kwchurch, I've created a new branch (see #124) implementing my suggestions above (explicit dtype setting and implicit node IDs setting). The scipy csr karate test case works fine on my end.

In the meantime, if you would like to give the new changes a try and let me know if this resolves your issue, that would be great. You can run it as before using

pecanpy --input demo/karate.bool.npz --output demo/karate.int.emb --mode SparseOTF

which will warn you about the implicit node IDs setting. To suppress that, you can set the --implicit_ids flag:

pecanpy --input demo/karate.bool.npz --output demo/karate.int.emb --mode SparseOTF --implicit_ids
kwchurch commented 2 years ago

ok

do you think it could check the datatypes and make the necessary conversions automatically?

On Wed, Jun 29, 2022 at 4:04 AM Remy Liu @.***> wrote:

Hi @kwchurch https://github.com/kwchurch, thank you for the detailed dev log! I slightly edited the format to further improve the readability. At a first glance, it looks to me like an issue of incompatible dtype. More specifically, the csr used by PecanPy uses uint32 for both the index and indptr fields, rather than int32 as used by scipy.sparse.csr. Similarly, PecanPy uses float32 instead of float64 for the data field in the csr object.

I think to resolve the type issue, the most straightforward solution is to enforce the desired types (i.e., float32 for data; uint32 for indices and `indptr) at loading time: https://github.com/krishnanlab/PecanPy/blob/49d60630b4589eeab992eef2da9c2eaf6b19fab8/src/pecanpy/graph.py#L432-L438

I will first try to reproduce the error here using the example script you provided, and then see if my proposed solution actually fixes the issue.

As we also discussed, I will add the option for implicitly assigning node IDs if it is not found in the .csr.npz file. I will make it so that it requires a "soft confirmation" from the user that the implicit assignment is desired by printing a warning message about the implicit assignment, unless a specific flag (e.g., --implicit_node_ids) is set.

— Reply to this email directly, view it on GitHub https://github.com/krishnanlab/PecanPy/issues/122#issuecomment-1169843912, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEKUDKLY6PB4MGDDAPQ45GTVRQUSTANCNFSM52C2UW3Q . You are receiving this because you were mentioned.Message ID: @.***>

RemyLau commented 2 years ago

@kwchurch yes it is doing that now https://github.com/krishnanlab/PecanPy/blob/a12f27c608bb5b72651481b80380bffdf42053ab/src/pecanpy/graph.py#L443-L445

kwchurch commented 2 years ago

great

On Wed, Jun 29, 2022 at 7:48 AM Remy Liu @.***> wrote:

@kwchurch https://github.com/kwchurch yes it is doing that now

https://github.com/krishnanlab/PecanPy/blob/a12f27c608bb5b72651481b80380bffdf42053ab/src/pecanpy/graph.py#L443-L445

— Reply to this email directly, view it on GitHub https://github.com/krishnanlab/PecanPy/issues/122#issuecomment-1170078473, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEKUDKJJCHYTZUN422CETSLVRRO4BANCNFSM52C2UW3Q . You are receiving this because you were mentioned.Message ID: @.***>

kwchurch commented 2 years ago

let me know when you have something ready to try out

On Wed, Jun 29, 2022 at 7:48 AM Remy Liu @.***> wrote:

@kwchurch https://github.com/kwchurch yes it is doing that now

https://github.com/krishnanlab/PecanPy/blob/a12f27c608bb5b72651481b80380bffdf42053ab/src/pecanpy/graph.py#L443-L445

— Reply to this email directly, view it on GitHub https://github.com/krishnanlab/PecanPy/issues/122#issuecomment-1170078473, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEKUDKJJCHYTZUN422CETSLVRRO4BANCNFSM52C2UW3Q . You are receiving this because you were mentioned.Message ID: @.***>

RemyLau commented 2 years ago

@kwchurch it is ready to be tried out, but it is not on the main branch. you'll need to checkout the scipy-csr branch, and you will find the new changes there.

RemyLau commented 2 years ago

Hi @kwchurch, I have completed some more testing and merged the new feature (implicit IDs) back to the main branch (see 2d58132807089e8f5fbd5095be342149a039bf18). Let me know if you get a chance to test and see if this works in your case.

kwchurch commented 1 year ago

I have some graphs with nodes that have no edges

Is that a problem?

init pecanpy: p = 1, q = 1, workers = 16, verbose = True, extend = True, gamma = 0, random_state = None

/home/k.church/venv/gft/lib/python3.8/site-packages/pecanpy/rw/sparse_rw.py:30: RuntimeWarning: Mean of empty slice.

data[indptr[i] : indptr[i + 1]].mean()

/home/k.church/venv/gft/lib/python3.8/site-packages/numpy/core/_methods.py:189: RuntimeWarning: invalid value encountered in double_scalars

ret = ret.dtype.type(ret / rcount)

/home/k.church/venv/gft/lib/python3.8/site-packages/numpy/core/_methods.py:262: RuntimeWarning: Degrees of freedom <= 0 for slice

ret = _var(a, axis=axis, dtype=dtype, out=out, ddof=ddof,

/home/k.church/venv/gft/lib/python3.8/site-packages/numpy/core/_methods.py:222: RuntimeWarning: invalid value encountered in true_divide

arrmean = um.true_divide(arrmean, div, out=arrmean, casting='unsafe',

/home/k.church/venv/gft/lib/python3.8/site-packages/numpy/core/_methods.py:254: RuntimeWarning: invalid value encountered in double_scalars

ret = ret.dtype.type(ret / rcount)

Traceback (most recent call last):

File "/var/spool/slurm/d/job27656002/slurm_script", line 8, in

sys.exit(main())

File "/home/k.church/venv/gft/lib/python3.8/site-packages/pecanpy/cli.py", line 333, in main

walks = simulate_walks(args, g)

File "/home/k.church/venv/gft/lib/python3.8/site-packages/pecanpy/wrappers.py", line 18, in wrapper

result = func(*args, **kwargs)

File "/home/k.church/venv/gft/lib/python3.8/site-packages/pecanpy/cli.py", line 320, in simulate_walks

return g.simulate_walks(args.num_walks, args.walk_length)

File "/home/k.church/venv/gft/lib/python3.8/site-packages/pecanpy/pecanpy.py", line 153, in simulate_walks

walk_idx_mat = self._random_walks(

File "/home/k.church/venv/gft/lib/python3.8/site-packages/numba/core/dispatcher.py", line 468, in _compile_for_args

error_rewrite(e, 'typing')

File "/home/k.church/venv/gft/lib/python3.8/site-packages/numba/core/dispatcher.py", line 409, in error_rewrite

raise e.with_traceback(None)

numba.core.errors.TypingError: Failed in nopython mode pipeline (step: nopython frontend)

^[[1m^[[1m^[[1m^[[1mFailed in nopython mode pipeline (step: nopython frontend)

^[[1m^[[1m^[[1m^[[1mFailed in nopython mode pipeline (step: nopython frontend)

^[[1m^[[1mNo implementation of function Function() found for signature:

imul(array(bool, 1d, C), array(float64, 1d, C))

There are 8 candidate implementations:

^[[1m - Of which 4 did not match due to:

Overload of function 'imul': File: : Line N/A.

With argument(s): '(array(bool, 1d, C), array(float64, 1d, C))':^[[0m

^[[1m No match.^[[0m

^[[1m - Of which 2 did not match due to:

Overload in function 'NumpyRulesInplaceArrayOperator.generic': File: numba/core/typing/npydecl.py: Line 244.

With argument(s): '(array(bool, 1d, C), array(float64, 1d, C))':^[[0m

^[[1m Rejected as the implementation raised a specific error:

 AttributeError: 'NoneType' object has no attribute 'args'^[[0m

raised from /home/k.church/venv/gft/lib/python3.8/site-packages/numba/core/typing/npydecl.py:255

^[[1m - Of which 2 did not match due to:

Operator Overload in function 'imul': File: unknown: Line unknown.

With argument(s): '(array(bool, 1d, C), array(float64, 1d, C))':^[[0m

On Wed, Jun 29, 2022 at 8:37 AM Remy Liu @.***> wrote:

@kwchurch https://github.com/kwchurch it is ready to be tried out, but it is not on the main branch. you'll need to checkout the scipy-csr branch, and you will find the new changes there.

— Reply to this email directly, view it on GitHub https://github.com/krishnanlab/PecanPy/issues/122#issuecomment-1170135979, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEKUDKNOBVOUVKHZ674HFRDVRRUSXANCNFSM52C2UW3Q . You are receiving this because you were mentioned.Message ID: @.***>