robinarthur / 5pk

0 stars 0 forks source link

Text preparation #1

Closed robinarthur closed 6 years ago

robinarthur commented 6 years ago
robinarthur commented 6 years ago

https://blog.exploratory.io/demystifying-text-analytics-part-2-quantifying-documents-by-calculating-tf-idf-in-r-756955faa1ea

robinarthur commented 6 years ago

mport glob from chardet.universaldetector import UniversalDetector detector = UniversalDetector() for filename in glob.glob('*.xml'): print filename.ljust(60), detector.reset() for line in file(filename, 'rb'): detector.feed(line) if detector.done: break detector.close() print detector.result

robinarthur commented 6 years ago

Von draussen reinlesen -> decode Nach draussen schreiben -> encode in Python (mit UTF codiertem Quellcode) -> u'München

robinarthur commented 6 years ago

https://stackoverflow.com/a/12886818/7477664 unzip the epub, after that go on with BeautifulSoup

robinarthur commented 6 years ago

everything fine in the acutal commit - issue could be closed