Python script that makes better use of more CPUs

XiaoranYan commented 5 years ago

Hi Filipi,

Thanks very much for these details! Can we schedule some time to meet about this? How about this Wednesday, Thursday, or Friday daytime, except Thursday morning? I'm in CST so we may have one hour time difference.

By the way, I've tried some packages in R but I'm not quite familiar with it so it might be hard for me use R in my research. I personally prefer Python in practice.

Best, Yi

From: Nascimento Silva, Filipi Sent: Tuesday, July 16, 2019 23:19 To: Bu, Yi Cc: Hutchinson, Matthew Alexander; Pentchev, Valentin; Min, Chao; Yan, Xiaoran Subject: Re: Python script that makes better use of more CPUs

Hi Yi,

As an alternative to pyspark, you can parallelize your code to run across the cores of a single machine (or node). This can be accomplished by using the joblib package, or you can do it manually with the multiprocessing or subprocess packages.

Here you can find some examples/documentation: https://joblib.readthedocs.io/en/latest/parallel.html

It is very easy to use and may not require many changes to your existing code, at least for independent loops or those with minimal writes to shared memory.

regards, Filipi

On Jul 16, 2019, at 1:30 PM, Yan, Xiaoran yan30@iu.edu wrote:

Hi Yi,

That was a great question that you have raised and we have been thinking how to best help you as well as prepare for our Rome events. We are thinking using your use case as a demo while we work together and try to find a solution to your scalability problem.

There are several ways you can parallelize your code. You can use standard parallel package like https://www.machinelearningplus.com/python/parallel-processing-python/ Or if your problem can be casted into a map reduce shape, then PySpark is a much easier tool, which you have already learned some basics. If it is a machine learning problem, there are standard machine learning packages like scikit-learn and tensorflow, some of which can even use GPU to further speed things up.

Mat, Filipi and I all have some experiences in different settings and it really depends on what you want to do with your code. We need details to proceed at this point. If you think this is appropriate, we can start planning our next meeting soon. This would also mean you will be presenting at the tutorial and Chao probably does the oral presentation at the workshop?

Let us know what do you think.

Xiaoran

XiaoranYan commented 5 years ago

On 7/16/19 10:00 AM, Hutchinson, Matthew Alexander wrote: Hi Yi,

I’m afraid I don’t have any experience with parallel programming with Python but here’s a brief intro into parallel programming in R: https://www.r-bloggers.com/how-to-go-parallel-in-r-basics-tips/ It may have some useful concepts and if you really want, you can use the R libraries directly in Python: https://sites.google.com/site/aslugsguidetopython/data-analysis/pandas/calling-r-from-python

A quick Google of ‘in-memory parallelism python’ turns up some articles which may be useful: https://www.toptal.com/python/beginners-guide-to-concurrency-and-parallelism-in-python The easiest type of parallel programming is when you have a large amount of data and you need to perform the same operation on each datum. So, you split the data into equal pieces with each thread performing the operation on that chunk of data.

Xiaoran probably has a more sophisticated understanding than I do but if it’s helpful, I can show you some of my parallel R code.

Matthew Hutchinson |INDIANA UNIVERSITY Data Manager IU Network Science Institute (IUNI) 1001 E SR 45/46 Bypass | Bloomington, IN 47408-1415 Email: maahutch@iu.edu |Phone: (812) 855-1404 Fax: (812) 856-1192

From: Bu, Yi buyi@iu.edu Sent: Monday, July 15, 2019 4:39 PM To: Hutchinson, Matthew Alexander maahutch@iu.edu; Yan, Xiaoran yan30@iu.edu Subject: Python script that makes better use of more CPUs

Hi Matthew and Xiaoran,

Great to met you guys today. You've mentioned that it is possible that I can improve my Python script for making better use of more CPUs for my project. Any potential resources (e.g., Python packages, Python scripts, Linux command line, etc.) that I can refer to?

Best, Yi

XiaoranYan commented 5 years ago

Hi Yi,

Sure. We plan to use your use case to build a demo for ISSI tutorial but let us first try to solve your problem on the new enclave. Please prepare your details by sharing us with your Python script and we can start from there.

I can meet on Wednesday, Thursday or Friday afternoon. Filipi, when would be a good time for you? Mat, do you want to join the meeting?

Xiaoran

XiaoranYan commented 5 years ago

Hi Yi,

I have solved your problem of finding all indirect citations for WoS papers using spark. You can find the resulting CSV (110gb) under the folder (/raw/cascades/indirectCitations.csv) in IUNI1 enclave, with the following columns:

focal (ego paper), connector, LE(late endorser).

In total there are 1.83 billion such triples. Please let me know if you find any problem with the data.

Since the data involves WoS and this repo is public, please refer to the MAG demo for code examples https://github.com/iuni-cadre/Fellow2-citation-cascades/issues/3

iuni-cadre / Fellow2-citation-cascades

Python script that makes better use of more CPUs #2