sguo666 / ExtractionfromPDF

0 stars 0 forks source link

How to extract info from PDF #2

Open sguo666 opened 6 years ago

sguo666 commented 6 years ago

Scanned PDFs: References:

  1. http://xiaofeima1990.github.io/2016/12/19/extract-text-from-sanned-pdf/
  2. https://datascience.blog.wzb.eu/2017/02/16/data-mining-ocr-pdfs-using-pdftabextract-to-liberate-tabular-data-from-scanned-documents/
  3. https://pythontips.com/2016/02/25/ocr-on-pdf-files-using-python/
  4. https://datascience.blog.wzb.eu/category/pdfs/
  5. https://stackoverflow.com/questions/6026287/batch-ocr-program-for-pdfs/6553015#6553015
sguo666 commented 6 years ago
  1. https://medium.com/@winston.smith.spb/python-ocr-for-pdf-or-compare-textract-pytesseract-and-pyocr-acb19122f38c

Issues:

  1. "ImportError: MagickWand shared library not found. You probably had not installed ImageMagick library. Try to install: brew install freetype imagemagick" Solution: https://stackoverflow.com/questions/37011291/python-wand-image-is-not-recognized/41772062#41772062

  2. "trying to install textract using: pip install textract i am getting an error : Failed building wheel for pocketsphinx unable to execute 'swig': No such file or directory." Solution: install swig http://macappstore.org/swig/

  3. https://stackoverflow.com/questions/tagged/pypdf?sort=frequent&pagesize=50#_=_

sguo666 commented 6 years ago

Spelling Check and Correction:

  1. http://norvig.com/spell-correct.html
  2. https://stackoverflow.com/questions/3788870/how-to-check-if-a-word-is-an-english-word-with-python
sguo666 commented 6 years ago

General References:

  1. http://okfnlabs.org/blog/2016/04/19/pdf-tools-extract-text-and-data-from-pdfs.html
sguo666 commented 6 years ago

Extract text except table: https://stackoverflow.com/questions/1848464/advanced-pdf-parsing-using-python-extracting-text-without-tables-etc-whats