rjurney / Agile_Data_Code_2

Code for Agile Data Science 2.0, O'Reilly 2017, Second Edition
http://bit.ly/agile_data_science
MIT License
456 stars 306 forks source link

chapter 4 save from pyspark to MongoDB exhausts Vagrant VM memory #96

Closed pjhinton closed 5 years ago

pjhinton commented 5 years ago

In Chapter 4, where flight data is published to MongoDB from pyspark:

https://learning.oreilly.com/a/agile-data-science/21213057/

if I use a Vagrant image based on box ubuntu/bionic64 built off of this source tree:

pjhinton/Agile_Data_Code_2@9fe4c71d21294078f304a1590d8d339257709b55

MongoDB runs out of memory:

Dec 21 17:09:02 ubuntu-bionic mongod[29354]: src/central_freelist.cc:333] tcmalloc: allocation failed 32768
Dec 21 17:09:03 ubuntu-bionic mongod[29354]: message repeated 2 times: [ src/central_freelist.cc:333] tcmalloc: allocation failed 32768]
Dec 21 17:09:06 ubuntu-bionic systemd[1]: mongodb.service: Main process exited, code=exited, status=14/n/a
Dec 21 17:09:06 ubuntu-bionic systemd[1]: mongodb.service: Failed with result 'exit-code'.

The version of MongoDB in use is the one supplied by Ubuntu:

# dpkg -l mongodb
Desired=Unknown/Install/Remove/Purge/Hold
| Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend
|/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad)
||/ Name                                          Version                     Architecture                Description
+++-=============================================-===========================-===========================-================================================================================================
ii  mongodb

Versions of pymongo and pymongo-spark in use are:

$ pip show pymongo
Name: pymongo
Version: 3.7.2
Summary: Python driver for MongoDB <http://www.mongodb.org>
Home-page: http://github.com/mongodb/mongo-python-driver
Author: Bernie Hackett
Author-email: bernie@mongodb.com
License: Apache License, Version 2.0
Location: /home/vagrant/anaconda/lib/python3.5/site-packages
Requires: 
Required-by: pymongo-spark
$ pip show pymongo-spark
Name: pymongo-spark
Version: 0.1.dev0
Summary: Utilities for using Spark with PyMongo
Home-page: https://github.com/mongodb/mongo-hadoop
Author: MongoDB, Inc.
Author-email: mongodb-user@googlegroups.com
License: http://www.apache.org/licenses/LICENSE-2.0.html
Location: /home/vagrant/anaconda/lib/python3.5/site-packages/pymongo_spark-0.1.dev0-py3.5.egg
Requires: pymongo
Required-by: 

The version of Spark is the one installed by bootstrap.sh: 2.2.1.

rjurney commented 5 years ago

Yes. This is why I use the EC2 script. It is hard to do big data on even moderate size data in Vagrant. You can alter the settings to give the VM more RAM, if you’ve got it. That is my suggestion, otherwise use EC2.

On Fri, Dec 21, 2018 at 9:27 AM pjhinton notifications@github.com wrote:

In Chapter 4, where flight data is published to MongoDB from pyspark:

https://learning.oreilly.com/a/agile-data-science/21213057/

if I use a Vagrant image based on box ubuntu/bionic64 built off of this source tree:

pjhinton/Agile_Data_Code_2@9fe4c71 https://github.com/pjhinton/Agile_Data_Code_2/commit/9fe4c71d21294078f304a1590d8d339257709b55

MongoDB runs out of memory:

Dec 21 17:09:02 ubuntu-bionic mongod[29354]: src/central_freelist.cc:333] tcmalloc: allocation failed 32768 Dec 21 17:09:03 ubuntu-bionic mongod[29354]: message repeated 2 times: [ src/central_freelist.cc:333] tcmalloc: allocation failed 32768] Dec 21 17:09:06 ubuntu-bionic systemd[1]: mongodb.service: Main process exited, code=exited, status=14/n/a Dec 21 17:09:06 ubuntu-bionic systemd[1]: mongodb.service: Failed with result 'exit-code'.

The version of MongoDB in use is the one supplied by Ubuntu:

dpkg -l mongodb

Desired=Unknown/Install/Remove/Purge/Hold | Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend |/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad) ||/ Name Version Architecture Description +++-=============================================-===========================-===========================-================================================================================================ ii mongodb

Versions of pymongo and pymongo-spark in use are:

$ pip show pymongo Name: pymongo Version: 3.7.2 Summary: Python driver for MongoDB http://www.mongodb.org Home-page: http://github.com/mongodb/mongo-python-driver Author: Bernie Hackett Author-email: bernie@mongodb.com License: Apache License, Version 2.0 Location: /home/vagrant/anaconda/lib/python3.5/site-packages Requires: Required-by: pymongo-spark

$ pip show pymongo-spark Name: pymongo-spark Version: 0.1.dev0 Summary: Utilities for using Spark with PyMongo Home-page: https://github.com/mongodb/mongo-hadoop Author: MongoDB, Inc. Author-email: mongodb-user@googlegroups.com License: http://www.apache.org/licenses/LICENSE-2.0.html Location: /home/vagrant/anaconda/lib/python3.5/site-packages/pymongo_spark-0.1.dev0-py3.5.egg Requires: pymongo Required-by:

The version of Spark is the one installed by bootstrap.sh: 2.2.1.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/rjurney/Agile_Data_Code_2/issues/96, or mute the thread https://github.com/notifications/unsubscribe-auth/AACkpYHUz-9tlv-R5A0DGFp6CemB6ARhks5u7Rn8gaJpZM4ZeUMV .

-- Russell Jurney @rjurney http://twitter.com/rjurney russell.jurney@gmail.com LI http://linkedin.com/in/russelljurney FB http://facebook.com/jurney datasyndrome.com