This PR provides a simpler way to re-parse HTML code of existing documents if the parser was broken and has been fixed in the meantime.
To test locally, first artificiallty remove the text of a document:
from inca import Inca
myinca = Inca()
doc = myinca.database.doctype_last('nu')[0]
doc['_source']['text'] = ''
database.update_document(doc, force=True)
# check whether text really has been removed:
g = myinca.database.document_generator('_id:"{}"'.format(doc['_id']))
print(next(g))```
Then, reparse all documents. By default, only documents where the text is missing are re-parsed:
```from inca import Inca
myinca = Inca()
from inca.rssscrapers import news_scraper
f = news_scraper.nu.parsehtml
g = myinca.database.doctype_generator('nu')
myinca.database.reparse(g, f, force=False)```
Check whether nothing is messed up.
Finally, if running with `force=True` instead, it should reparse everything (which may be a VERY bad idea if the parser doesn't work and it overwrites existing stuff.
This PR provides a simpler way to re-parse HTML code of existing documents if the parser was broken and has been fixed in the meantime.
To test locally, first artificiallty remove the text of a document: