jcushman / pdfquery

A fast and friendly PDF scraping library.
MIT License
772 stars 89 forks source link

use two or more consecutive 'in_bbox' #43

Closed charlessachet closed 8 years ago

charlessachet commented 8 years ago

Hi guys! o/ I wanna know if has a way to execute one 'in_bbox' followed by another 'in_bbox'. For example:

first_bbox = pdf_query.pq('LTTextLineHorizontal:in_bbox("%s, %s, %s, %s")' % (x0, y0, x1, y1)) second_bbox = first_bbox.pq('LTTextLineHorizontal:in_bbox("%s, %s, %s, %s")' % (x0, y0, x1, y1))

charlessachet commented 8 years ago

I studied better the code and got the expected result with that:

import pdfquery
from pdfquery.pdftranslator import PDFQueryTranslator
from pyquery import PyQuery

def get_pyquery(tree):
    return PyQuery(tree, css_translator=PDFQueryTranslator())

first_bbox = pdf_query.pq('LTTextLineHorizontal:in_bbox("%s, %s, %s, %s")' % (x0, y0, x1, y1))
info_frame = get_pyquery(first_bbox)
second_bbox = info_frame('LTTextLineHorizontal:in_bbox("%s, %s, %s, %s")' % (x0, y0, x1, y1))

Simple, I know... auehuea Anyway thanks for share the project :))