Closed wanghaisheng closed 6 years ago
Hey @wanghaisheng , dunno what @jsvine has planned but at the moment this project doesn't do machine vision type algorithms at all--instead it sits on top of pdfminer, a library that reads the pdf file internals and returns all the "things" in the pdf file. PDF is a display-oriented language, so most of what's in the files are instructions, like 'show an 'e' in Arial-10 font at this x-y position'. PDFminer returns those details--and it's important to note that they aren't present in 'image' based pdfs that haven't been OCR'ed etc. So the 'input' are characters with bounding boxes . Finding the horizontal edges of a row of text is an operation performed on the bounding boxes. Not quite sure if that's what you're asking?
@jsfenfen thx very much. now i can say you guys implement some kind of edge detection and table detection with native pdf instead of treating pdf like image.do you have estimation about how long it takes implement the vison type using such opencv ?
do you have estimation about how long it takes implement the vison type using such opencv ?
Hey @wanghaisheng that's something I definitely don't know! I think the idea of using programs like pdfminer is that extracted data alone is pretty powerful--or at least helpful, in capturing really repetitive, machine generated structures. But doing fancier things with opencv is solving a different problem (probably?). A lot of work on images and text is deep-learning based OCR, though arguably what's needed more than that is deep-learning based layout detection. But you might look into the work of David Doermann.
@jsfenfen no that is not i am looking for . to my knowledge we can use tools like pdfminer to get x y,using fonts,distance,position as a one dimension feature to determine layout for our domain specific data ,but with the supplement of two-dimensional features we got from image convert from pdf ,we can get more precise table detection and extraction data later on
Hi @wanghaisheng, and thanks @jsfenfen. pdfplumber
implements a version of Nurminen's approach in table.py
. One major difference, as @jsfenfen notes, is that Nurminen's original approach uses computer vision to detects edges, whereas pdfplumber
the lines and character bounding boxes explicitly provided by fully-digital PDFs.
I'm working on a major update to pdfplumber
that will change some aspects of how the user specifies which types of edges to use when finding tables. But the general approach — and, especially, the conversion of these edges to tables — will remain the same.
@jsvine @jsfenfen little code for you guys
import cv2
import numpy as np
from matplotlib import pyplot as plt
import json
import sys
import subprocess
import os
class detectTable(object):
def __init__(self, src_img):
self.src_img = src_img
def run(self):
if len(self.src_img.shape) == 2: # 灰度图
gray_img = self.src_img
elif len(self.src_img.shape) ==3:
gray_img = cv2.cvtColor(self.src_img, cv2.COLOR_BGR2GRAY)
thresh_img = cv2.adaptiveThreshold(~gray_img,255,cv2.ADAPTIVE_THRESH_MEAN_C,cv2.THRESH_BINARY,15,-2)
h_img = thresh_img.copy()
v_img = thresh_img.copy()
scale = 15
h_size = int(h_img.shape[1]/scale)
h_structure = cv2.getStructuringElement(cv2.MORPH_RECT,(h_size,1)) # 形态学因子
h_erode_img = cv2.erode(h_img,h_structure,1)
h_dilate_img = cv2.dilate(h_erode_img,h_structure,1)
# cv2.imshow("h_erode",h_dilate_img)
v_size = int(v_img.shape[0] / scale)
v_structure = cv2.getStructuringElement(cv2.MORPH_RECT, (1, v_size)) # 形态学因子
v_erode_img = cv2.erode(v_img, v_structure, 1)
v_dilate_img = cv2.dilate(v_erode_img, v_structure, 1)
mask_img = h_dilate_img+v_dilate_img
joints_img = cv2.bitwise_and(h_dilate_img,v_dilate_img)
joints_img = cv2.dilate(joints_img,None,iterations=3)
cv2.imwrite("joints.png",~joints_img)
cv2.imwrite("mask.png",~mask_img)
if __name__=='__main__':
img = cv2.imread(sys.argv[1])
detectTable(img).run()
reference note in Anssi Nurminen's master's thesis, "An edge in an image is defined as an above-threshold change in intensity value of neighboring pixels. Choosing a threshold value too high, some of the more subtle visual aids on a page will not be detected, while a threshold value too low can result in a lot of erroneously interpreted edges" "The edge detection process is divided into four distinct steps that are described in more detail in the following chapters: