uvacw / inca

24 stars 6 forks source link

Reparse #488

Closed damian0604 closed 5 years ago

damian0604 commented 5 years ago

This PR provides a simpler way to re-parse HTML code of existing documents if the parser was broken and has been fixed in the meantime.

To test locally, first artificiallty remove the text of a document:


from inca import Inca

myinca = Inca()

doc = myinca.database.doctype_last('nu')[0]

doc['_source']['text'] = ''
database.update_document(doc, force=True)

# check whether text really has been removed:
g = myinca.database.document_generator('_id:"{}"'.format(doc['_id']))
print(next(g))```

Then, reparse all documents. By default, only documents where the text is missing are re-parsed:
```from inca import Inca
myinca = Inca() 
from inca.rssscrapers import news_scraper
f = news_scraper.nu.parsehtml
g = myinca.database.doctype_generator('nu')

myinca.database.reparse(g, f,  force=False)```

Check whether nothing is messed up.

Finally, if running with `force=True` instead, it should reparse everything (which may be a VERY bad idea if the parser doesn't work and it overwrites existing stuff.
mariekevh commented 5 years ago

As for the TODO: I think title can be replaced with the same logic: replace when empty string?