usgs / groundmotion-processing

Parsing and processing ground motion data
Other
54 stars 42 forks source link

Can we use numpy structured arrays to eliminate the dependency on pandas? #42

Closed baagaard-usgs closed 3 years ago

baagaard-usgs commented 5 years ago

pandas is a pretty big module with several dependencies, including intel-openmp (at least when installed via miniconda). It looks like it is only used in stream. Could we use numpy structured arrays instead and eliminate this dependency?

The dependency of pandas on intel-openmp via miniconda generates some parallel processing related warnings on my machine when I use the python multiprocessing module in some other code.

emthompson-usgs commented 5 years ago

I would also like for us to get rid of pandas as a dependency (largely because I think it is a poorly designed package) but pandas is pretty heavily used and so a migration to any other solution will require significant work at this point. I also think that the use of virtual environments should solve any conflicts you are having with other code.

Did you install this code with the install.sh script that puts it in a "gmprocess" virtual environment? If so, does it still cause conflicts with other code?

baagaard-usgs commented 5 years ago

It looked to me like pandas use was isolated, but it may be that the dataframes ended up getting passed around to a lot of other routines.

Parallel processing in ground-motion processing

I would like groundmotion-workworkspace and groundmotion-processing to be designed so that we can leverage multiple cores to speed up the processing. For example, we could process each station in a different Python process using the multiprocessing module. If pandas conflicts with this, then that would be another reason to eliminate it as a dependency.

Installation info

I manually installed the dependencies via conda install using the list of packages in the install.sh file. I created a separate workspace, so there should not be any conflicts. My comment about conflicts with multiprocessing in other code was for another project in which I used miniconda.

emthompson-usgs commented 5 years ago

Good points and we'll keep this in mind. On other projects we have done some parallelization in environments that also had pandas so I don't think it is a deal breaker. But we'll have to think carefully about how the parallelization is done and consider pandas alternatives.

cbworden commented 5 years ago

Parallelization in Python can be tricky. Many packages appear to be written with non-reentrant functions, or with casual use of globals that may be changed by the importing process. I think this is largely due to the perception that the GIL will prevent any problems from what is, honestly, lazy, incompetent programming. Matplotlib is chief among the culprits. I don't know about Pandas, but it wouldn't surprise me. In any event, we will have to approach parallelization on a case by case basis. Cython can give us some options that pure Python cannot, by more or less sidestepping the GIL so we can do multithreading in C. We do this in ShakeMap and it has been very effective (but not without complications in terms of added dependencies).

Also, I would be careful about pointing fingers. It may be Pandas that throws the error, but the problem may exist elsewhere. For instance, I was recently getting errors that looked like they were coming from the concurrent futures module, but were really the result of bad Matplotlib implementation. Good OOP rarely needs globals, and trying to emulate Matlab's plotting was, IMHO, a gigantic mistake. And now we are all stuck with a package that should have had a very narrow audience, but has become the only viable way to plot anything in Python.

Using Cython largely avoids these issues by keeping the threading within the compute-heavy functions. Numpy also does a good job of optimizing and threading some operations via BLAS and the like (e.g., matrix multiply), so it's almost impossible to improve their performance for some operations. But where Numpy is not optimized, Cython is again a good option. Numexpr has also proven to be good at threading the things that Numpy won't, like element-by-element operations on matrices. But in my experience, Cython is usually somewhat faster, and I've converted all of ShakeMap's Numexpr operations to Cython with good result. Ultimately, profiling and experimentation will reveal the best path to follow.

emthompson-usgs commented 5 years ago

To give some specifics about optimization what we've done so far that @baagaard-usgs may not know about:

emthompson-usgs commented 3 years ago

Cleaning out stale issues... I think we're not going to be able to remove pandas at this point but we will be working on improving parallelization in the near future.