jsvine / pdfplumber

Plumb a PDF for detailed information about each char, rectangle, line, et cetera — and easily extract text and tables.
MIT License
6.71k stars 665 forks source link

edge detection algorithm #38

Closed wanghaisheng closed 6 years ago

wanghaisheng commented 7 years ago

reference note in Anssi Nurminen's master's thesis, "An edge in an image is defined as an above-threshold change in intensity value of neighboring pixels. Choosing a threshold value too high, some of the more subtle visual aids on a page will not be detected, while a threshold value too low can result in a lot of erroneously interpreted edges" "The edge detection process is divided into four distinct steps that are described in more detail in the following chapters:

  1. Finding horizontal edges.
  2. Finding vertical edges.
  3. Finding crossing edges and creating “snapping points”.
  4. Finding cells (closed rectangular areas)." could you point me where this wonderful tools implement these edge detection algorithm? @jsvine
jsfenfen commented 7 years ago

Hey @wanghaisheng , dunno what @jsvine has planned but at the moment this project doesn't do machine vision type algorithms at all--instead it sits on top of pdfminer, a library that reads the pdf file internals and returns all the "things" in the pdf file. PDF is a display-oriented language, so most of what's in the files are instructions, like 'show an 'e' in Arial-10 font at this x-y position'. PDFminer returns those details--and it's important to note that they aren't present in 'image' based pdfs that haven't been OCR'ed etc. So the 'input' are characters with bounding boxes . Finding the horizontal edges of a row of text is an operation performed on the bounding boxes. Not quite sure if that's what you're asking?

wanghaisheng commented 7 years ago

@jsfenfen thx very much. now i can say you guys implement some kind of edge detection and table detection with native pdf instead of treating pdf like image.do you have estimation about how long it takes implement the vison type using such opencv ?

jsfenfen commented 7 years ago

do you have estimation about how long it takes implement the vison type using such opencv ?

Hey @wanghaisheng that's something I definitely don't know! I think the idea of using programs like pdfminer is that extracted data alone is pretty powerful--or at least helpful, in capturing really repetitive, machine generated structures. But doing fancier things with opencv is solving a different problem (probably?). A lot of work on images and text is deep-learning based OCR, though arguably what's needed more than that is deep-learning based layout detection. But you might look into the work of David Doermann.

wanghaisheng commented 7 years ago

@jsfenfen no that is not i am looking for . to my knowledge we can use tools like pdfminer to get x y,using fonts,distance,position as a one dimension feature to determine layout for our domain specific data ,but with the supplement of two-dimensional features we got from image convert from pdf ,we can get more precise table detection and extraction data later on

jsvine commented 7 years ago

Hi @wanghaisheng, and thanks @jsfenfen. pdfplumber implements a version of Nurminen's approach in table.py. One major difference, as @jsfenfen notes, is that Nurminen's original approach uses computer vision to detects edges, whereas pdfplumber the lines and character bounding boxes explicitly provided by fully-digital PDFs.

I'm working on a major update to pdfplumber that will change some aspects of how the user specifies which types of edges to use when finding tables. But the general approach — and, especially, the conversion of these edges to tables — will remain the same.

wanghaisheng commented 6 years ago

@jsvine @jsfenfen little code for you guys

import cv2
import numpy as np
from matplotlib import pyplot as plt
import json
import sys
import subprocess
import os
class detectTable(object):
    def __init__(self, src_img):
        self.src_img = src_img

    def run(self):
        if len(self.src_img.shape) == 2:  # 灰度图
            gray_img = self.src_img
        elif len(self.src_img.shape) ==3:
            gray_img = cv2.cvtColor(self.src_img, cv2.COLOR_BGR2GRAY)

        thresh_img = cv2.adaptiveThreshold(~gray_img,255,cv2.ADAPTIVE_THRESH_MEAN_C,cv2.THRESH_BINARY,15,-2)
        h_img = thresh_img.copy()
        v_img = thresh_img.copy()
        scale = 15
        h_size = int(h_img.shape[1]/scale)

        h_structure = cv2.getStructuringElement(cv2.MORPH_RECT,(h_size,1)) # 形态学因子
        h_erode_img = cv2.erode(h_img,h_structure,1)

        h_dilate_img = cv2.dilate(h_erode_img,h_structure,1)
        # cv2.imshow("h_erode",h_dilate_img)
        v_size = int(v_img.shape[0] / scale)

        v_structure = cv2.getStructuringElement(cv2.MORPH_RECT, (1, v_size))  # 形态学因子
        v_erode_img = cv2.erode(v_img, v_structure, 1)
        v_dilate_img = cv2.dilate(v_erode_img, v_structure, 1)

        mask_img = h_dilate_img+v_dilate_img
        joints_img = cv2.bitwise_and(h_dilate_img,v_dilate_img)
        joints_img = cv2.dilate(joints_img,None,iterations=3)
        cv2.imwrite("joints.png",~joints_img)
        cv2.imwrite("mask.png",~mask_img)

if __name__=='__main__':
    img = cv2.imread(sys.argv[1])
    detectTable(img).run()