openeventdata / phoenix_pipeline

Turning news into events since 2014.
MIT License
50 stars 33 forks source link

FTP Server Parameters for Uploader #86

Closed dbl001 closed 8 years ago

dbl001 commented 9 years ago

Hi,

What should the server parameters be the FTP server for the uploader. I tried 'localhost' and the Elastic IP address of my Ubuntu EC2 server, but I'm getting errors connecting:

Login to unsuccessful.
Error on the uploader. This step isn't absolutely necessary. Valid events should still be generated. PHOX.pipeline end: 2015-08-21 21:42:06.258537

I don't see any 'stories' in MongoDB:

show dbs local 0.078GB show collections show users show logs global startupWarnings

dbl001 commented 9 years ago

Here's the PHOX_pipeline.log
output:

INFO 2015-08-23 09:28:19,009: Formatting events for output. INFO 2015-08-23 09:28:41,228: Writing event output. INFO 2015-08-23 09:29:58,354: Running phox_uploader.py INFO 2015-08-23 09:30:41,319: Logged into: 52.8.16.250// WARNING 2015-08-23 09:32:22,564: Store of Phoenix.events.20150822.txt.zip unsuccessful ERROR 2015-08-23 09:32:22,569: Store of Phoenix.events.20150822.txt .zip unsuccessful

WARNING 2015-08-23 09:33:45,925: Store of Phoenix.events.20150822.txt.zip unsuccessful ERROR 2015-08-23 09:33:48,327: Transfer of Phoenix.events.20150822.txt unsuccessful

Here's the output files:

-rw-r--r-- 1 davidlaxer staff 268697 Aug 23 09:23 PETRARCH.log -rw-r--r-- 1 davidlaxer staff 3 Aug 23 09:28 counter.txt -rw-r--r-- 1 davidlaxer staff 43473 Aug 23 09:29 events.full.20150822.txt -rw-r--r-- 1 davidlaxer staff 22 Aug 23 09:32 Phoenix.events.20150822.txt.zip -rw-r--r-- 1 davidlaxer staff 88613 Aug 23 09:33 PHOX_pipeline.log

Any idea why the Phoenix.events.20150822.txt.zip is 22 bytes, but the events.full.20150822.txt is 43473 bytes containing 194 events?

johnb30 commented 9 years ago

You would need to add in the credentials for your own FTP server for it to work. Adding in an Elasticsearch IP wouldn't work since the program isn't setup to work with ES. You could write your own logic in the uploader.py file, though.

On the stories/file size issue, you need to make sure that you kick off a scraper to put stories into the database.

dbl001 commented 9 years ago

The stories are in MongoDB and PETRARCH has parsed them:

{ "_id" : ObjectId("55d99d06f5bce77a44a95b9b"), "content" : "Ways to check drug abuse at education institutions discussed: Anti-Narcotics Force (ANF) Commander Brigadier Muhammad Abuzar held a meeting with the heads of educational institutions of Karachi on Saturday and discussed with them drug abuse and ways to prevent it. The spokesman for the ANF Sindh said they organised the meeting of heads of educational institutions, including universities and colleges, at the regional directorate for the prevention of rising trends of drug abuse.", "source" : "int_the_news_karachi", "date" : "Sun, 23 Aug 2015 00:00:00 +0500", "language" : "english", "title" : "Ways to check drug abuse at education institutions discussed", "url" : "http://feedproxy.google.com/~r/TheNewsInternational-Karachi/~3/-HV_1OLdQC8/Todays-News-4-335727-Ways-to-check-drug-abuse-at-education-institutions-discussed", "date_added" : ISODate("2015-08-23T10:14:30.115Z"), "stanford" : 1, "parsed_sents" : [ "(ROOT (S (S (NP (NNP Ways)) (VP (TO to) (VP (VB check) (NP (NN drug) (NN abuse)) (PP (IN at) (NP (NP (NN education) (NNS institutions)) (VP (VBN discussed)))))) (: :) (S (NP (NNP Anti-Narcotics) (NNP Force) (PRN (-LRB- -LRB-) (NP (NNP ANF)) (-RRB- -RRB-)) (NNP Commander) (NNP Brigadier) (NNP Muhammad) (NNP Abuzar)) (VP (VP (VBD held) (NP (NP (DT a) (NN meeting)) (PP (IN with) (NP (NP (DT the) (NNS heads)) (PP (IN of) (NP (NP (JJ educational) (NNS institutions)) (PP (IN of) (NP (NNP Karachi))))) (PP (IN on) (NP (NNP Saturday))))))) (CC and) (VP (VBN discussed) (PP (IN with) (NP (NP (PRP them)) (NP (NP (NN drug) (NN abuse)) (CC and) (NP (NP (NNS ways)) (SBAR (S (VP (TO to) (VP (VB prevent) (NP (PRP it)))))))))))))) (. .)))", "(ROOT (S (NP (NP (DT The) (NN spokesman)) (PP (IN for) (NP (DT the) (NNP ANF) (NNP Sindh)))) (VP (VBD said) (SBAR (S (NP (PRP they)) (VP (VBD organised) (NP (NP (DT the) (NN meeting)) (PP (IN of) (NP (NP (NNS heads)) (PP (IN of) (NP (NP (JJ educational) (NNS institutions)) (, ,) (PP (VBG including) (NP (NNS universities) (CC and) (NNS colleges) (, ,) (PP (IN at) (NP (DT the) (JJ regional) (NN directorate))) (PP (IN for) (NP (NP (DT the) (NN prevention)) (PP (IN of) (NP (NP (VBG rising) (NNS trends)) (PP (IN of) (NP (NN drug) (NN abuse)))))))))))))))))) (. .)))" ] } Type "it" for more

show collections stories system.indexes db.stories.count() 8637

FTP to my EC2 server @52.8.16.250 is working:

David-Laxers-MacBook-Pro:phoenix_pipeline davidlaxer$ ftp 52.8.16.250 Connected to 52.8.16.250. 220 (vsFTPd 3.0.2) Name (52.8.16.250:davidlaxer): david 331 Please specify the password. Password: 230 Login successful. Remote system type is UNIX. Using binary mode to transfer files. ftp> bin 200 Switching to Binary mode. ftp> put events.full.20150822.txt local: events.full.20150822.txt remote: events.full.20150822.txt 229 Entering Extended Passive Mode (|||40046|). ftp: Can't connect to `52.8.16.250': Operation timed out 200 EPRT command successful. Consider using EPSV. 150 Ok to send data. 100% |***| 43643 9.57 MiB/s 00:00 ETA 226 Transfer complete. 43643 bytes sent in 00:00 (278.04 KiB/s) ftp> ls -l events.full.20150822.txt output to local-file: events.full.20150822.txt [anpqy?]? q output to local-file: aborted. ftp> ls events.full.20150822.txt 200 EPRT command successful. Consider using EPSV. 150 Here comes the directory listing. -rw-r--r-- 1 1003 1005 43643 Aug 24 14:38 events.full.20150822.txt 226 Directory send OK. ftp>

It’s not an ElasticSearch IP, EC2 IP address which my instance uses:

On Aug 24, 2015, at 6:53 AM, John Beieler notifications@github.com wrote:

You would need to add in the credentials for your own FTP server for it to work. Adding in an Elasticsearch IP wouldn't work since the program isn't setup to work with ES. You could write your own logic in the uploader.py file, though.

On the stories/file size issue, you need to make sure that you kick off a scraper to put stories into the database.

— Reply to this email directly or view it on GitHub https://github.com/openeventdata/phoenix_pipeline/issues/86#issuecomment-134210430.

dbl001 commented 9 years ago

The phoenix_pipline uses these file stems (from PHOX_config.ini )

[Pipeline] scraper_stem = scraperresults recordfile_stem = eventrecords. fullfile_stem = events.full. eventfile_stem = Phoenix.events. dupfile_stem = Phoenix.dupindex. outputfile_stem = Phoenix.events.20 newsourcestem = new sources.

PATRARCH created the fulfile stem correctly:

-rw-r--r-- 1 davidlaxer staff 43643 Aug 23 15:50 events.full.20150822.txt

However, the uploader (tries to Zip the ‘filename’ using the eventfile suffix ( Phoenix.events.20150822.txt), but that file doesn’t exist).

-rw-r--r-- 1 davidlaxer staff 22 Aug 23 12:08 events.20150822.txt.zip -rw-r--r-- 1 davidlaxer staff 530 Aug 23 12:19 PHOX_config.ini -rw-r--r-- 1 davidlaxer staff 43643 Aug 23 15:50 events.full.20150822.txt -rw-r--r-- 1 davidlaxer staff 4 Aug 23 15:50 counter.txt -rw-r--r-- 1 davidlaxer staff 22 Aug 23 15:50 Phoenix.events.20150822.txt.zip -rw-r--r-- 1 davidlaxer staff 88791 Aug 23 15:50 PHOX_pipeline.log -rw-r--r-- 1 davidlaxer staff 268697 Aug 23 15:50 PETRARCH.log

On Aug 24, 2015, at 7:39 AM, David Laxer davidl@softintel.com wrote:

The stories are in MongoDB and PETRARCH has parsed them:

{ "_id" : ObjectId("55d99d06f5bce77a44a95b9b"), "content" : "Ways to check drug abuse at education institutions discussed: Anti-Narcotics Force (ANF) Commander Brigadier Muhammad Abuzar held a meeting with the heads of educational institutions of Karachi on Saturday and discussed with them drug abuse and ways to prevent it. The spokesman for the ANF Sindh said they organised the meeting of heads of educational institutions, including universities and colleges, at the regional directorate for the prevention of rising trends of drug abuse.", "source" : "int_the_news_karachi", "date" : "Sun, 23 Aug 2015 00:00:00 +0500", "language" : "english", "title" : "Ways to check drug abuse at education institutions discussed", "url" : "http://feedproxy.google.com/~r/TheNewsInternational-Karachi/~3/-HV_1OLdQC8/Todays-News-4-335727-Ways-to-check-drug-abuse-at-education-institutions-discussed http://feedproxy.google.com/~r/TheNewsInternational-Karachi/~3/-HV_1OLdQC8/Todays-News-4-335727-Ways-to-check-drug-abuse-at-education-institutions-discussed", "date_added" : ISODate("2015-08-23T10:14:30.115Z"), "stanford" : 1, "parsed_sents" : [ "(ROOT (S (S (NP (NNP Ways)) (VP (TO to) (VP (VB check) (NP (NN drug) (NN abuse)) (PP (IN at) (NP (NP (NN education) (NNS institutions)) (VP (VBN discussed)))))) (: :) (S (NP (NNP Anti-Narcotics) (NNP Force) (PRN (-LRB- -LRB-) (NP (NNP ANF)) (-RRB- -RRB-)) (NNP Commander) (NNP Brigadier) (NNP Muhammad) (NNP Abuzar)) (VP (VP (VBD held) (NP (NP (DT a) (NN meeting)) (PP (IN with) (NP (NP (DT the) (NNS heads)) (PP (IN of) (NP (NP (JJ educational) (NNS institutions)) (PP (IN of) (NP (NNP Karachi))))) (PP (IN on) (NP (NNP Saturday))))))) (CC and) (VP (VBN discussed) (PP (IN with) (NP (NP (PRP them)) (NP (NP (NN drug) (NN abuse)) (CC and) (NP (NP (NNS ways)) (SBAR (S (VP (TO to) (VP (VB prevent) (NP (PRP it)))))))))))))) (. .)))", "(ROOT (S (NP (NP (DT The) (NN spokesman)) (PP (IN for) (NP (DT the) (NNP ANF) (NNP Sindh)))) (VP (VBD said) (SBAR (S (NP (PRP they)) (VP (VBD organised) (NP (NP (DT the) (NN meeting)) (PP (IN of) (NP (NP (NNS heads)) (PP (IN of) (NP (NP (JJ educational) (NNS institutions)) (, ,) (PP (VBG including) (NP (NNS universities) (CC and) (NNS colleges) (, ,) (PP (IN at) (NP (DT the) (JJ regional) (NN directorate))) (PP (IN for) (NP (NP (DT the) (NN prevention)) (PP (IN of) (NP (NP (VBG rising) (NNS trends)) (PP (IN of) (NP (NN drug) (NN abuse)))))))))))))))))) (. .)))" ] } Type "it" for more

show collections stories system.indexes db.stories.count() 8637

FTP to my EC2 server @52.8.16.250 is working:

David-Laxers-MacBook-Pro:phoenix_pipeline davidlaxer$ ftp 52.8.16.250 Connected to 52.8.16.250. 220 (vsFTPd 3.0.2) Name (52.8.16.250:davidlaxer): david 331 Please specify the password. Password: 230 Login successful. Remote system type is UNIX. Using binary mode to transfer files. ftp> bin 200 Switching to Binary mode. ftp> put events.full.20150822.txt local: events.full.20150822.txt remote: events.full.20150822.txt 229 Entering Extended Passive Mode (|||40046|). ftp: Can't connect to `52.8.16.250': Operation timed out 200 EPRT command successful. Consider using EPSV. 150 Ok to send data. 100% |***| 43643 9.57 MiB/s 00:00 ETA 226 Transfer complete. 43643 bytes sent in 00:00 (278.04 KiB/s) ftp> ls -l events.full.20150822.txt output to local-file: events.full.20150822.txt [anpqy?]? q output to local-file: aborted. ftp> ls events.full.20150822.txt 200 EPRT command successful. Consider using EPSV. 150 Here comes the directory listing. -rw-r--r-- 1 1003 1005 43643 Aug 24 14:38 events.full.20150822.txt 226 Directory send OK. ftp>

It’s not an ElasticSearch IP, EC2 IP address which my instance uses: <Screen Shot 2015-08-24 at 7.35.35 AM.png>

On Aug 24, 2015, at 6:53 AM, John Beieler <notifications@github.com mailto:notifications@github.com> wrote:

You would need to add in the credentials for your own FTP server for it to work. Adding in an Elasticsearch IP wouldn't work since the program isn't setup to work with ES. You could write your own logic in the uploader.py file, though.

On the stories/file size issue, you need to make sure that you kick off a scraper to put stories into the database.

— Reply to this email directly or view it on GitHub https://github.com/openeventdata/phoenix_pipeline/issues/86#issuecomment-134210430.

johnb30 commented 9 years ago

Sorry, misunderstood the question. I would suggest changing the eventfile_stem to events.full.. I'm a little hazy on how the zip/FTP upload works since our production version is a little bit different from what's in the current, open source version.

dbl001 commented 9 years ago

Ok. It seemed like a configuration issue.

On Aug 24, 2015, at 7:58 AM, John Beieler notifications@github.com wrote:

Sorry, misunderstood the question. I would suggest changing the eventfile_stem to events.full.. I'm a little hazy on how the zip/FTP upload works since our production version is a little bit different from what's in the current, open source version.

— Reply to this email directly or view it on GitHub https://github.com/openeventdata/phoenix_pipeline/issues/86#issuecomment-134234523.