Open thequbit opened 10 years ago
This has been implemented in a very lite wrapper in db_api.py.
Example of actual document broadcast:
{
u'source_id': u'b9f5e0a8-2390-43ae-8ef1-77651c6b3d7c',
u'message': {
u'doc_url': u'http: //timduffy.me/Resume-TimDuffy-20130813.pdf',
u'link_text': u'Resume',
u'url_data': {
u'status': u'running',
u'doc_type': u'application/pdf',
u'start_datetime': u'2014-07-1815: 34: 25',
u'target_url': u'http: //timduffy.me/',
u'max_link_level': 3,
u'description': u"Tim Duffy's Personal Website",
u'title': u'TimDuffy.Me',
u'runs': [
],
u'scraper_id': u'b9f5e0a8-2390-43ae-8ef1-77651c6b3d7c',
u'frequency': 2,
u'finish_datetime': u'',
u'creation_datetime': u'2014-07-1815: 34: 25',
u'allowed_domains': [
]
},
u'scrape_datetime': u'2014-07-1815: 34: 25'
},
u'command': u'found_doc',
u'destination_id': u'broadcast'
}
Response from db_api.get_one_not_uploaded_document() :
{
u'uploaded': False,
u'doc_url': u'http: //timduffy.me/Resume-TimDuffy-20130813.pdf',
u'url_data': {
u'status': u'running',
u'doc_type': u'application/pdf',
u'start_datetime': u'2014-07-1815: 32: 34',
u'target_url': u'http: //timduffy.me/',
u'max_link_level': 3,
u'description': u"Tim Duffy's Personal Website",
u'title': u'TimDuffy.Me',
u'runs': [
],
u'scraper_id': u'692127e0-9d3a-4f99-ae39-f206e2a32f75',
u'frequency': 2,
u'finish_datetime': u'',
u'creation_datetime': u'2014-07-1815: 32: 34',
u'allowed_domains': [
]
},
u'link_text': u'Resume',
u'scrape_datetime': u'2014-07-1815: 32: 35',
u'insert_datetime': u'2014-07-1815: 32: 35.075349',
u'source_id': u'692127e0-9d3a-4f99-ae39-f206e2a32f75',
u'_id': ObjectId('53c97653a70f9e356ba0df44')
}
example payload:
Mongodb is a great fit since the data is 'schemaless'.