kanzure / pdfparanoia

pdf watermark removal library for academic papers
https://pypi.python.org/pypi/pdfparanoia
533 stars 52 forks source link

[Q] When you can't match text content..? #39

Closed fmap closed 10 years ago

fmap commented 10 years ago

What's the procedure for identifying symbol-referenced (?) text within documents? Matching the content text isn't working in this case:

>>> doc = pdfparanoia.parser.parse_content(open("tests/samples/jstor/231a515256115368c142f528cee7f727.pdf","rb").read())                                                     
>>> for id in doc.xrefs[0].get_objids():
...   try:
...     if ("Accessed" in (doc.getobj(id).get_data())): print id
...   except: continue
...
>>>

..but delimiters seem to've been identified before, same example:

# "Accessed on dd/mm/yyy hh:mm"
#
# the "Accessed" line is only on the first page
#
# it's based on /F2
#
# This would be better if it could be decoded to
# actually search for the "Accessed" text.
if page_id == 0 and "/F2 11 Tf\n" in better_content:
    startpos = better_content.rfind("/F2 11 Tf\n")
    endpos = better_content.find("Tf\n", startpos+5)

    if verbose >= 2 and replacements:
        sys.stderr.write("%s: Found object %s with %r: %r; omitting..." % (cls.__name__, objid, cls.requirements, better_content[startpos:endpos]))

    better_content = better_content[0:startpos] + better_content[endpos:]

replacements.append([objid, better_content])

page_id += 1
fmap commented 10 years ago

I think I see; comparing with the structure of the page, ["/F2 11 Tf".."Tf\n"] delimits the region between the URL and the T&C message:

>>> doc.getobj(19).get_data()
'q\n\nq\nBT\n36 806 Td\nET\nQ\nq\n0 0 0 RG\n/P <</MCID 0>> BDC\nq\n0 0 0 RG\n/Figure <</MCID 0>> BDC\nq 220 0 0 91 50 671 cm /img0 Do Q\nQ\nEMC\nBT\n1 0 0 1 50 643 Tm\n/F1 12 Tf\n()Tj\nET\n0.5 w\n50 633 m\n562 633 l\nS\nBT\n1 0 0 1 50 610 Tm\n/F2 11 Tf\n(\x000\x00H\x00P\x00R\x00L\x00U\x00\x0f\x00\x03\x006\x00R\x00F\x00L\x00D\x00O\x00\x03\x00+\x00L\x00V\x00W\x00R\x00U\x00\\\\\x00\x03\x00D\x00Q\x00G\x00\x03\x00&\x00R\x00P\x00P\x00L\x00W\x00P\x00H\x00Q\x00W\x00\x1d\x00\x03\x00\\(\x00U\x00L\x00F\x00\x03\x00+\x00R\x00E\x00V\x00E\x00D\x00Z\x00P\x00\\n\x00V\x00\x03\x00\x05\x00,\x00Q\x00W\x00H\x00U\x00H\x00V\x00W\x00L\x00Q\x00J\x00\x03\x007\x00L\x00P\x00H\x00V\x00\x05)Tj\nET\nBT\n1 0 0 1 50 597 Tm\n/F2 11 Tf\n(\x00$\x00X\x00W\x00K\x00R\x00U\x00\x0b\x00V\x00\\f\x00\x1d\x00\x03)Tj\n(\x00-\x00D\x00P\x00H\x00V\x00\x03\x00\\(\x00\x11\x00\x03\x00&\x00U\x00R\x00Q\x00L\x00Q)Tj\nET\nBT\n1 0 0 1 49 584 Tm\n/F2 11 Tf\n(\x005\x00H\x00Y\x00L\x00H\x00Z\x00H\x00G\x00\x03\x00Z\x00R\x00U\x00N\x00\x0b\x00V\x00\\f\x00\x1d)Tj\nET\nBT\n1 0 0 1 50 571 Tm\n/F2 11 Tf\n(\x006\x00R\x00X\x00U\x00F\x00H\x00\x1d\x00\x03)Tj\n1 0 0.21256 1 91.76 571 Tm\n(\x00-\x00R\x00X\x00U\x00Q\x00D\x00O\x00\x03\x00R\x00I\x00\x03\x006\x00R\x00F\x00L\x00D\x00O\x00\x03\x00+\x00L\x00V\x00W\x00R\x00U\x00\\\\\x00\x0f\x00\x03)Tj\n1 0 0 1 236.67 571 Tm\n(\x009\x00R\x00O\x00\x11\x00\x03\x00\x16\x00\x1a\x00\x0f\x00\x03\x001\x00R\x00\x11\x00\x03\x00\x14\x00\x0f\x00\x03\x006\x00S\x00H\x00F\x00L\x00D\x00O\x00\x03\x00,\x00V\x00V\x00X\x00H\x00\x03\x00\x0b\x00$\x00X\x00W\x00X\x00P\x00Q\x00\x0f\x00\x03\x00\x15\x00\x13\x00\x13\x00\x16\x00\\f\x00\x0f\x00\x03\x00S\x00S\x00\x11\x00\x03\x00\x15\x00\x14\x00\x1c\x00\x10\x00\x15\x00\x16\x00\x14)Tj\n-186.67 0 Td\nET\n0 0 1 RG\n0.73333 w\n126.14 554.33 m\n233.07 554.33 l\nS\n0 G\n1 w\nBT\n1 0 0 1 50 558 Tm\n/F2 11 Tf\n(\x003\x00X\x00E\x00O\x00L\x00V\x00K\x00H\x00G\x00\x03\x00E\x00\\\\\x00\x1d\x00\x03)Tj\n/F3 11 Tf\n0 0 1 rg\n(Oxford University Press)Tj\n0 g\nET\n0 0 1 RG\n0.73333 w\n115.27 541.33 m\n275.39 541.33 l\nS\n0 G\n1 w\n1 1 1 rg\n275.39 542.61 5.5 9.9 re\nf\n0 g\nBT\n1 0 0 1 50 545 Tm\n/F2 11 Tf\n(\x006\x00W\x00D\x00E\x00O\x00H\x00\x03\x008\x005\x00/\x00\x1d\x00\x03)Tj\n/F3 11 Tf\n0 0 1 rg\n(http://www.jstor.org/stable/3790325)Tj\n0 g\n1 1 1 rg\n( .)Tj\n0 g\nET\nBT\n1 0 0 1 50 529 Tm\n/F2 11 Tf\n(\x00$\x00F\x00F\x00H\x00V\x00V\x00H\x00G\x00\x1d\x00\x03\x00\x13\x00\x14\x00\x12\x00\x13\x00\x15\x00\x12\x00\x15\x00\x13\x00\x14\x00\x16\x00\x03\x00\x14\x00\x1b\x00\x1d\x00\x18\x00\x15)Tj\nET\n0.5 w\n50 519 m\n562 519 l\nS\n1 1 1 rg\n471.03 494.61 5.5 9.9 re\nf\n0 g\n0 0 1 RG\n0.66667 w\n50 481.67 m\n270.28 481.67 l\nS\n0 G\n1 w\n1 1 1 rg\n50 458.61 5.5 9.9 re\nf\n0 g\nBT\n1 0 0 1 50 497 Tm\n/F3 10 Tf\n(Your use of the JSTOR archive indicates your acceptance of the Terms & Conditions of Use, available at)Tj\n/F3 11 Tf\n1 1 1 rg\n( .)Tj\n0 g\n1 0 0 1 50 485 Tm\n/F3 10 Tf\n0 0 1 rg\n(http://www.jstor.org/page/info/about/policies/terms.jsp)Tj\n0 g\n1 0 0 1 50 473 Tm\n()Tj\n1 0 0 1 50 461 Tm\n/F3 11 Tf\n1 1 1 rg\n( .)Tj\n0 g\nET\n1 1 1 rg\n50 398.61 5.5 9.9 re\nf\n0 g\nBT\n1 0 0 1 50 449 Tm\n/F3 10 Tf\n(JSTOR is a not-for-profit service that helps scholars, researchers, and students discover, use, and build upon a wide range of)Tj\n1 0 0 1 50 437 Tm\n(content in a trusted digital archive. We use information technology and tools to increase productivity and facilitate new forms)Tj\n1 0 0 1 50 425 Tm\n(of scholarship. For more information about JSTOR, please contact support@jstor.org.)Tj\n1 0 0 1 50 413 Tm\n()Tj\n1 0 0 1 50 401 Tm\n/F3 11 Tf\n1 1 1 rg\n( .)Tj\n0 g\nET\nq\n0 0 0 RG\n/Figure <</MCID 0>> BDC\nq 60 0 0 65.59 50 50 cm /img1 Do Q\nQ\nEMC\nBT\n1 0 0 1 115 105 Tm\n/F4 10 Tf\n(Oxford University Press)Tj\n/F3 10 Tf\n( is collaborating with JSTOR to digitize, preserve and extend access to )Tj\n/F4 10 Tf\n(Journal of)Tj\n1 0 0 1 115 95 Tm\n(Social History.)Tj\nET\nBT\n0 Tr\n/F3 10 Tf\n1 0 0 1 50 40 Tm\n(http://www.jstor.org )Tj\nET\nQ\nEMC\n\n Q\nq\nq\n1 1 1 rg\n0 -36 595 36 re\nf\nQ\nq\n2 J\n0 G\nQ\n0 0 1 RG\n0.53333 w\n278.05 -26.67 m\n374.72 -26.67 l\nS\n0 G\n1 w\nBT\n1 0 0 1 0 -8 Tm\n/Xi0 8 Tf\n()Tj\n1 0 0 1 203.39 -16 Tm\n(This content downloaded  on Fri, 1 Feb 2013 18:52:40 PM)Tj\n1 0 0 1 220.28 -24 Tm\n(All use subject to )Tj\n0 0 1 rg\n(JSTOR Terms and Conditions)Tj\n0 g\nET\nQ\n'
>>>