gt-big-data / QDoc

Quick & Dirty Operating Crawler
4 stars 1 forks source link

Don't make DB query when parsing an article. #27

Open supersam654 opened 8 years ago

supersam654 commented 8 years ago

We're making a DB query to figure out which tags to remove on a per-source basis when parsing articles. This can cause deadlocks on some systems and warnings on others (I believe warnings on Debian and possible deadlocks on OSX). It would be better if we got the source-specific cleaning stuff from the db and passed the actual data along with the article HTML to get parsed.

A sample error is:

/home/bdc/anaconda2/lib/python2.7/site-packages/pymongo/topology.py:74: UserWarning: MongoClient opened before fork. Create MongoClient with connect=False, or create client after forking. See PyMongo's documentation for details: http://api.mongodb.org/python/current/faq.html#using-pymongo-with-multiprocessing>
  "MongoClient opened before fork. Create MongoClient "
supersam654 commented 8 years ago

This is actually a breaking issue on Windows and is pretty severe. Hopefully someone remedies it soon.