oudalab / fajita

Event Data Tagging Tool
MIT License
7 stars 3 forks source link

Explore big dataset disk_stories_full in mongo db #197

Open YanLiang1102 opened 7 years ago

YanLiang1102 commented 7 years ago

using c++ directly plan to to gain the efficiency, here is one record like a schema in the table: { "_id" : ObjectId("572c0953172ab83173aaf011"), "news_source" : "Associated Press International", "position_section" : "SPORTS NEWS", "word_count" : "577", "states" : [ "OHIO, USA" ], "id_type" : "DOC-ID", "date_added" : ISODate("2016-05-06T03:02:43.316Z"), "cities" : [ "OAKLAND, CA, USA" ], "article_title" : "Raiders go QB route for Fresno State's Carr", "article_body" : "NEW YORK (AP) - Derek Carr, the brother of former No.1 draft pick David Carr, was selected by the Oakland Raiders with selection 36 of the 2014 edition on Friday. A day after the first-round selections were completed, teams went about filling their rosters with further selections on Friday. Derek Carr is a quarterback like his brother, went to Fresno State, like his brother, and enters the league with a wife and child, like his brother, but is hoping the similarities end there, given David Carr's disappointing career after being drafted by Houston with much fanfare back in 2002. \"I learned everything that he did right and everything that he did wrong,\" Derek Carr said. \"He told me that if he could do anything, he hopes he made the path smoother for me as I transition into the NFL.\" Derek Carr rewrote the record book in his time in college, throwing for more than 10,000 yards and 100 touchdown passes, leading Fresno State to consecutive Mountain West Conference titles. Oakland has veteran Matt Schaub earmarked to be its starting quarterback, but he will get a serious push from Carr. In other picks Friday: - Houston used the first pick of the second round on UCLA guard Xavier Su'a-Filo, who joins the first overall pick, Jadeveon Clowney, in a upgraded defensive line. The 6-foot-4, 307-pound Su'a-Filo, who went on a Mormon mission while in college, also has played tackle. - The Cowboys took Boise State defensive end Demarcus Lawrence, who they hope will emulate their departed sacks leader with the same first name, DeMarcus Ware, now with Denver. \"I'm my own Demarcus,\" Lawrence said. \"I don't like to try to be nobody else. I'm going to be me, and I'm going to do it well.\" - Cleveland added a protector for new quarterback Johnny Manziel by grabbing guard Joel Bitonio of Nevada, who also can play tackle or center. The Browns caused the biggest stir on opening night when they traded up to No. 22 to get 'Johnny Football'. \"He's a heck of a quarterback,\" Bitonio said. \"Hopefully, he comes in and he's ready to compete and just ready to work and do well for the Cleveland Browns.\" Cleveland did not choose any receivers even though Josh Gordon is reportedly facing suspension by the NFL for violating the league's drug policy again. Gordon was suspended for the first two games of 2013, but still led the league with 1,646 yards receiving. - Eastern Illinois quarterback Jimmy Garoppolo went to New England near the end of the second round, and will be a backup to his favorite player Tom Brady. \"Whether I was coming in as the starter or as the backup, I'm going to go in and approach it the same way,\" Garoppolo said. \"I'm going to go out there and try to get better each and every day. That's what good football players do.\" - Washington may have got a bargain by selecting Virginia tackle Morgan Moses at pick 66. Moses had been earmarked as a potential first-round pick by many pundits but never received a call. \"I thought my phone was broken,\" Moses quipped. It took 54 selections, a draft record, for a running back to go. Bishop Sankey of Washington was chosen by Tennessee, who cut Chris Johnson this spring. Two more went in the next three selections: Jeremy Hill of Louisiana State to Cincinnati, and Carlos Hyde of Ohio State to San Francisco. AP College Football Writer Ralph D. Russo and Sports Writers Simmi Buttar, Schuyler Dixon and Josh Dubow contributed to this story. AP NFL website: www.pro32.ap.org and www.twitter.com/AP_NFL", "language" : "english", "stanford" : 0, "countries" : [ "UNITED STATES" ], "publication_date_raw" : "May 10, 2014 Saturday", "doc_id" : "TOPNEWSa6f3d910954c4d228015df686134c54d", "parsed" : 1, "queue_added" : 0 }

YanLiang1102 commented 7 years ago

77300000 finished processing Traceback (most recent call last): File "distinct.py", line 18, in for i in largestory.find(): File "/home/yan/.local/lib/python3.5/site-packages/pymongo/cursor.py", line 1132, in next if len(self.data) or self._refresh(): File "/home/yan/.local/lib/python3.5/site-packages/pymongo/cursor.py", line 1075, in _refresh self.max_await_time_ms)) File "/home/yan/.local/lib/python3.5/site-packages/pymongo/cursor.py", line 892, in __send_message *kwargs) File "/home/yan/.local/lib/python3.5/site-packages/pymongo/mongo_client.py", line 950, in _send_message_with_response exhaust) File "/home/yan/.local/lib/python3.5/site-packages/pymongo/mongo_client.py", line 961, in _reset_on_error return func(args, **kwargs) File "/home/yan/.local/lib/python3.5/site-packages/pymongo/server.py", line 136, in send_message_with_response response_data = sock_info.receive_message(1, request_id) File "/home/yan/.local/lib/python3.5/site-packages/pymongo/pool.py", line 510, in receive_message self._raise_connection_failure(error) File "/home/yan/.local/lib/python3.5/site-packages/pymongo/pool.py", line 610, in _raise_connection_failure raise error File "/home/yan/.local/lib/python3.5/site-packages/pymongo/pool.py", line 508, in receive_message self.sock, operation, request_id, self.max_message_size) File "/home/yan/.local/lib/python3.5/site-packages/pymongo/network.py", line 137, in receive_message header = _receive_data_on_socket(sock, 16) File "/home/yan/.local/lib/python3.5/site-packages/pymongo/network.py", line 170, in _receive_data_on_socket raise AutoReconnect("connection closed") pymongo.errors.AutoReconnect: connection closed

Mongo db just died in the middle not sure why!!! 2017-10-16T09:37:45.768-0500 W NETWORK [thread1] Failed to connect to 127.0.0.1:23755, in(checking socket for error after poll), reason: Connection refused 2017-10-16T09:37:45.776-0500 E QUERY [thread1] Error: couldn't connect to server localhost:23755, connection attempt failed : connect@src/mongo/shell/mongo.js:237:13 @(connect):1:6 exception: connect failed

@cegme

cegme commented 7 years ago

Try and use the with statement to open the mongo document. Can you also add a timer? The connection may be timing out. Also, use JSON, stop using pickle.

YanLiang1102 commented 7 years ago

By mongo file do u mean access that table, it is but directly access the file on disk, it get the data using pumibgo through mongo db

YanLiang1102 commented 7 years ago

*it is not directly

cegme commented 7 years ago

@YanLiang1102 by mongo document, I mean the MongoClient. In any case. I think it is our bad network complaining. Just catch this error, add a sleep and try again.

sleep = 1
done = False
while not done:
    try:
        # your code HERE
        done = True
    except pymongo.AutoReconnect:
        logging.info("Error connecting sleeping for {}".format(pow(2, sleep)))
        time.sleep(pow(2, sleep))
        sleep += 1
        logging.info("retrying...")
YanLiang1102 commented 7 years ago

So u suspect our network will time out when the connection is keeping alive for too long like 3 or 4 hours, I kind of think it is mongo client issue , they might only to keep the db connection for a certain amount of time @cegme since I run that code on portland local, is the network still going to affecting this on local?

cegme commented 7 years ago

The network can go out and interrupt a TCP at any time. A local run should have a network problem still adding a sleep can be a remedy. Add socketKeepAlive=True to the mongo client connection.

YanLiang1102 commented 7 years ago

@cegme I don't think the code will work Dr Grant, in the way you write the cursor is changed, so it will loop from the beginning. and I noticed that each time this program dies, the mongo db is down, and sometimes I can not even resatrt it need to reboot to make that restart, it must be some bad query to make the server down I can take a look tonight and see if there is any solution for that.

YanLiang1102 commented 7 years ago

File "1021.py", line 20, in for i in largestory.find(): File "/home/yan/.local/lib/python3.5/site-packages/pymongo/cursor.py", line 1132, in next if len(self.data) or self._refresh(): File "/home/yan/.local/lib/python3.5/site-packages/pymongo/cursor.py", line 1075, in _refresh self.max_await_time_ms)) File "/home/yan/.local/lib/python3.5/site-packages/pymongo/cursor.py", line 892, in __send_message *kwargs) File "/home/yan/.local/lib/python3.5/site-packages/pymongo/mongo_client.py", line 950, in _send_message_with_response exhaust) File "/home/yan/.local/lib/python3.5/site-packages/pymongo/mongo_client.py", line 961, in _reset_on_error return func(args, **kwargs) File "/home/yan/.local/lib/python3.5/site-packages/pymongo/server.py", line 136, in send_message_with_response response_data = sock_info.receive_message(1, request_id) File "/home/yan/.local/lib/python3.5/site-packages/pymongo/pool.py", line 510, in receive_message self._raise_connection_failure(error) File "/home/yan/.local/lib/python3.5/site-packages/pymongo/pool.py", line 610, in _raise_connection_failure raise error File "/home/yan/.local/lib/python3.5/site-packages/pymongo/pool.py", line 508, in receive_message self.sock, operation, request_id, self.max_message_size) File "/home/yan/.local/lib/python3.5/site-packages/pymongo/network.py", line 137, in receive_message header = _receive_data_on_socket(sock, 16) File "/home/yan/.local/lib/python3.5/site-packages/pymongo/network.py", line 170, in _receive_data_on_socket raise AutoReconnect("connection closed") pymongo.errors.AutoReconnect: connection closed

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "1021.py", line 35, in except pymongo.errors.AutoReconnect:

NameError: name 'pymongo' is not defined

YanLiang1102 commented 7 years ago

put limit on poolSize does help , but the mongo demon will down too, and the nohup style code to make it restart by itself, just like what we do for the website. import os import subprocess import time import urllib.request while True: time.sleep(30) try: returncode=os.system("nc -zvv localhost port for mongodb") if(returncode!=0): os.system("sudo restart mongo db commadn that we are using --port ") except: os.system("forever start mongodb!!")