nlpub / chinese-whispers

An implementation of Chinese Whispers in Python.
https://chinese-whispers.readthedocs.io
MIT License
58 stars 13 forks source link

Cython implementation #2

Closed flppgg closed 5 years ago

flppgg commented 5 years ago

This is a Cython implementation of the original module. Still a beta version, it only supports the "top" algorithm. I left the original files unchanged, a part from adding a new section to the README.

alexanderpanchenko commented 5 years ago

Hello, this is really great idea which we discussed but never had time to implement. However please make sure all checks pass and make it like an extra class not a complete replacement of the current network x algorithm.

dustalov commented 5 years ago

I like the idea! I believe this might be an attractive option for making CW quick. I recommend paying attention to the following:

  1. Please make sure that your code is compatible with the last version in master because the API has changed since 0.2 (and do not commit anything related to the versioning).
  2. Please also make sure that your code does not break the current implementation by using it in-place. How do you represent a graph in memory? I think the best option here should be to build an adjacency matrix by copying the nx.Graph structure.

Just out of curiosity: CW is a fast algorithm and the current implementation should work fine in most cases. How large are your graphs? What is the difference in running time?

Sorry, I will be on vacation next week far away from the keyboard.

flppgg commented 5 years ago

Ok thank you for your comments! _I will make it an extra class no problem, I just have to figure out the best way of merging the setup file. _The graph is represented in memory as c arrays. First I use the networkx to_scipy_sparse_matrix method, using the 'lil' format. Then I build 5 arrays that hold the labels, edge and weight data in a convenient format. I will add some comments to the code to make it more clear. _I had graphs with a few million nodes and about 10mln edges, and I didn't want to use your java implementation. The python version was really just too slow. _there are still a few bits I want to improve, especially with the weighted graph function. Do you have suggestions on how to store the weight data for each edge? I am now using numpy.float32, but it takes a heavy toll on memory.

flppgg commented 5 years ago

Finally managed to fix all the issues in the setup file. Basically it tries to build the Cython module (chinese_whispers.cyt) from the pyx file if Cython is installed, or from the precompiled C file otherwise. If numpy is not installed, or if the C file is not available, it installs your version of the module. As you can see in the README, you can run it with chinese_whispers.cyt.chinese_whispers(G). I hope this is ok?

dustalov commented 5 years ago

Just arrived. I will look at the code tomorrow.

dustalov commented 5 years ago

I have checked the code. I am looking positively at accepting this pull request. Before this I need to understand the following things:

  1. We use Travis CI for releasing our library to PyPI. Since the code to be pulled is experimental, I would like to keep releasing the pure-Python version. Is it possible to use our original setup.py while those who want blazingly fast computation can manually build and install your code via setup-cython.py?
  2. I do not understand why an automatically generated file cyt.c is included in the repository. I had no prior experience with Cython, but can it be generated automatically while building? If so, it can be in .gitignore.
  3. Cython code uses print() to standard output for logging. Could you please use sys.stderr instead?
flppgg commented 5 years ago

(EDIT Sorry I replied directly to the email and the formatting messed up)

Thank you Dmitry,

The reason why the c file is included is that this is the suggested way to distribute Cython software as you can see in the docs: https://cython.readthedocs.io/en/latest/src/userguide/source_files_and_compilation.html#distributing-cython-modules

Basically it allows users who don't have Cython installed to use the pre-compiled c file.

Perhaps a good solution to your first point is to add a YES / NO input clause during installation, where the user is asked if they want to install the cyt module or just the Python implementation. What do you think? If you'd still rather do it the way you proposed it's ok for me.

I will fix the logging to sys.sterr this evening, no problem.

dustalov commented 5 years ago

Thank you for pointing this out! We rely on Travis CI for uploading packages to PyPI, so as soon as we can produce the .c file during the build, it is fine to avoid committing this file.

My primary concern is the further maintenance of this code. I need to be generally confident in understanding the pipeline. I hope to finish polishing this PR till the end of this week.

flppgg commented 5 years ago

Ok it's no problem with me, let me know if you have other questions / suggestions and we can get it done in a few days.

dustalov commented 5 years ago

After several hours of research, I still have no idea how to deploy this from Travis CI to PyPI properly. Please submit only the cyt.pyx file. I will bring the rest of the wrapping code and configuration.

flppgg commented 5 years ago

Ok I am sorry to hear that. I have very little experience with Travis CI but I can have a look at the docs and see if I find a useful example. In theory, large python projects (maintained on Github and deployed to PyPI) often have optional Cython components to speed up execution, so I think we should be able to find material to copy from.

dustalov commented 5 years ago

Let's see what happens after merging your code :)

flppgg commented 5 years ago

Ok so should I remove the .c file? and add a setup_cython file like it was at the very beginning?

dustalov commented 5 years ago

Yes, please keep only cyt.pyx. I will arrange everything needed to make it work.

dustalov commented 5 years ago

Closing this because #3 was merged. Thank you!