jorisschellekens / borb

borb is a library for reading, creating and manipulating PDF files in python.
https://borbpdf.com/
Other
3.37k stars 148 forks source link

Not all authors are listed #181

Closed Ceasea closed 11 months ago

Ceasea commented 11 months ago

Hi, thanks for the excellent lib.

I currently extract metadata in PDF. I found I can extract one author but the pdf file lists three.

I also have tried other libs (no offense), the extracted results are the same, only one author name.

output: filename and pdf.metadata['author']
In-situ synthesis and chemical bonding of the Al-doped β-SiC particles in Al-Si-C light alloys.pdf Xiaofan Du

I know this may be not related the lib but the pdf file. However, I've checked the pdf's properties, which does list the three authors.

This problem really confuses me. I uploaded the pdf file and I hope I can get some advice from you. Thank you very much.

1-s2.0-S2211379722007082-main.pdf

jorisschellekens commented 11 months ago

The PDF standard defines 2 ways to set meta-data on a PDF document. And in the case of your PDF, both of these are used, and they are not in sync.

This is the /Info dictionary of your PDF (I reformatted it a bit for clarity):

611 0 obj
<< /Creator (Elsevier) 
/CrossMarkDomains#5B1#5D ([elsevier.com](http://elsevier.com/)) 
/CrossmarkMajorVersionDate (2010-04-23) 
/CreationDate--Text (4th December 2022) 
/ElsevierWebPDFSpecifications (7.0) 
/robots (noindex) 
/ModDate (D:20221204102501Z) 
/Author (Xiaofan Du) 
/doi (10.1016/j.rinp.2022.106094) /Title (þÿ I n - s i t u   s y n t h e s i s   a n d   c h e m i c a l   b o n d i n g   o f   t h e   A l - d o p e d   ² - S i C   p a r t i c l e s   i n   A l - S i - C   l i g h t   a l l o y s) 
/Keywords (SiC crystal structure,Aluminum doped,Chemical bonding,First-principles calculations,Mechanical properties) /CreationDate (D:20221204102326Z) 
/Producer (Acrobat Distiller 8.1.0 \(Windows\)) 
/Subject (Results in Physics, 43 \(2022\) 106094. doi:10.1016/j.rinp.2022.106094) 
/CrossMarkDomains#5B2#5D ([sciencedirect.com](http://sciencedirect.com/)) 
/CrossmarkDomainExclusive (true)
>>

This /Info dictionary lists 1 author:

/Author (Xiaofan Du) 

The second meta-data information carrier is so called XMP (eXtensible Metadata Platform):

614 0 obj
<< /Length 5676 /Subtype /XML /Type /Metadata
>>
stream
<?xpacket begin="" id="W5M0MpCehiHzreSzNTczkc9d"?>
<x:xmpmeta xmlns:x="adobe:ns:meta/" x:xmptk="XMP Core 5.1.2">
   <rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
      <rdf:Description rdf:about=""
            xmlns:ali="http://www.niso.org/schemas/ali/1.0/">
         <ali:license_ref>
            <rdf:Bag>
               <rdf:li rdf:parseType="Resource">
                  <ali:uri>http://creativecommons.org/licenses/by-nc-nd/4.0/</ali:uri>
               </rdf:li>
            </rdf:Bag>
         </ali:license_ref>
      </rdf:Description>
      <rdf:Description rdf:about=""
            xmlns:crossmark="http://crossref.org/crossmark/1.0/">
         <crossmark:CrossMarkDomains>
            <rdf:Seq>
               <rdf:li>[elsevier.com](http://elsevier.com/)</rdf:li>
               <rdf:li>[sciencedirect.com](http://sciencedirect.com/)</rdf:li>
            </rdf:Seq>
         </crossmark:CrossMarkDomains>
         <crossmark:CrossmarkDomainExclusive>true</crossmark:CrossmarkDomainExclusive>
         <crossmark:DOI>10.1016/j.rinp.2022.106094</crossmark:DOI>
         <crossmark:MajorVersionDate>2010-04-23</crossmark:MajorVersionDate>
      </rdf:Description>
      <rdf:Description rdf:about=""
            xmlns:dc="http://purl.org/dc/elements/1.1/">
         <dc:format>application/pdf</dc:format>
         <dc:identifier>10.1016/j.rinp.2022.106094</dc:identifier>
         <dc:publisher>
            <rdf:Bag>
               <rdf:li>Elsevier B.V.</rdf:li>
            </rdf:Bag>
         </dc:publisher>
         <dc:description>
            <rdf:Alt>
               <rdf:li xml:lang="x-default">Results in Physics, 43 (2022) 106094. doi:10.1016/j.rinp.2022.106094</rdf:li>
            </rdf:Alt>
         </dc:description>
         <dc:subject>
            <rdf:Bag>
               <rdf:li>SiC crystal structure</rdf:li>
               <rdf:li>Aluminum doped</rdf:li>
               <rdf:li>Chemical bonding</rdf:li>
               <rdf:li>First-principles calculations</rdf:li>
               <rdf:li>Mechanical properties</rdf:li>
            </rdf:Bag>
         </dc:subject>
         <dc:title>
            <rdf:Alt>
               <rdf:li xml:lang="x-default">In-situ synthesis and chemical bonding of the Al-doped β-SiC particles in Al-Si-C light alloys</rdf:li>
            </rdf:Alt>
         </dc:title>
         <dc:creator>
            <rdf:Seq>
               <rdf:li>Xiaofan Du</rdf:li>
               <rdf:li>Zhao Qian</rdf:li>
               <rdf:li>Xiangfa Liu</rdf:li>
            </rdf:Seq>
         </dc:creator>
      </rdf:Description>
      <rdf:Description rdf:about=""
            xmlns:jav="http://www.niso.org/schemas/jav/1.0/">
         <jav:journal_article_version>VoR</jav:journal_article_version>
      </rdf:Description>
      <rdf:Description rdf:about=""
            xmlns:pdf="http://ns.adobe.com/pdf/1.3/">
         <pdf:CreationDate--Text>4th December 2022</pdf:CreationDate--Text>
         <pdf:Producer>Acrobat Distiller 8.1.0 (Windows)</pdf:Producer>
         <pdf:Keywords>SiC crystal structure,Aluminum doped,Chemical bonding,First-principles calculations,Mechanical properties</pdf:Keywords>
      </rdf:Description>
      <rdf:Description rdf:about=""
            xmlns:pdfx="http://ns.adobe.com/pdfx/1.3/">
         <pdfx:CreationDate--Text>4th December 2022</pdfx:CreationDate--Text>
         <pdfx:CrossMarkDomains>
            <rdf:Seq>
               <rdf:li>[sciencedirect.com](http://sciencedirect.com/)</rdf:li>
               <rdf:li>[elsevier.com](http://elsevier.com/)</rdf:li>
            </rdf:Seq>
         </pdfx:CrossMarkDomains>
         <pdfx:CrossmarkDomainExclusive>true</pdfx:CrossmarkDomainExclusive>
         <pdfx:CrossmarkMajorVersionDate>2010-04-23</pdfx:CrossmarkMajorVersionDate>
         <ZlkjsyMiJnMmKoweGz9z8ysNNywmPlt6OowmGzdaLyPmKowz-ndn.o9ePot6Pnd6SmtuTma/>
         <pdfx:doi>10.1016/j.rinp.2022.106094</pdfx:doi>
         <pdfx:robots>noindex</pdfx:robots>
      </rdf:Description>
      <rdf:Description rdf:about=""
            xmlns:prism="http://prismstandard.org/namespaces/basic/3.0/">
         <prism:aggregationType>journal</prism:aggregationType>
         <prism:copyright>© 2022 The Author(s). Published by Elsevier B.V.</prism:copyright>
         <prism:coverDate>2022-12-01</prism:coverDate>
         <prism:coverDisplayDate>1 December 2022</prism:coverDisplayDate>
         <prism:doi>10.1016/j.rinp.2022.106094</prism:doi>
         <prism:issn>2211-3797</prism:issn>
         <prism:pageRange>106094</prism:pageRange>
         <prism:publicationName>Results in Physics</prism:publicationName>
         <prism:startingPage>106094</prism:startingPage>
         <prism:url>https://doi.org/10.1016/j.rinp.2022.106094</prism:url>
         <prism:volume>43</prism:volume>
      </rdf:Description>
      <rdf:Description rdf:about=""
            xmlns:xmp="http://ns.adobe.com/xap/1.0/">
         <xmp:CreateDate>2022-12-04T10:23:26</xmp:CreateDate>
         <xmp:CreatorTool>Elsevier</xmp:CreatorTool>
         <xmp:MetadataDate>2022-12-04T10:25:01</xmp:MetadataDate>
         <xmp:ModifyDate>2022-12-04T10:25:01</xmp:ModifyDate>
      </rdf:Description>
      <rdf:Description rdf:about=""
            xmlns:xmpMM="http://ns.adobe.com/xap/1.0/mm/">
         <xmpMM:DocumentID>uuid:1b499bed-4ce8-4c0c-b682-0d58baae1cbe</xmpMM:DocumentID>
         <xmpMM:InstanceID>uuid:b6ed5266-03cb-4855-90c0-636283357f42</xmpMM:InstanceID>
      </rdf:Description>
      <rdf:Description rdf:about=""
            xmlns:xmpRights="http://ns.adobe.com/xap/1.0/rights/">
         <xmpRights:Marked>True</xmpRights:Marked>
      </rdf:Description>
   </rdf:RDF>
</x:xmpmeta>

<?xpacket end="w"?>
endstream
endobj

Here we have 3 authors listed:

         <dc:creator>
            <rdf:Seq>
               <rdf:li>Xiaofan Du</rdf:li>
               <rdf:li>Zhao Qian</rdf:li>
               <rdf:li>Xiangfa Liu</rdf:li>
            </rdf:Seq>
         </dc:creator>

You can extract XMP meta-information using borb by the way. Check the examples

In short, your PDF is kind of "broken". It contains conflicting information with regards to the author. Which means libraries will either give you one, or the other.

Ceasea commented 11 months ago

Hi, thansk for the explanantion.

I have tried to extract XMP meta information using borb as you suggested.

The result remains the same. @jorisschellekens

def test():
    import typing
    from borb.pdf import Document
    from borb.pdf import PDF
    doc: typing.Optional[Document] = None
    filename = '1-s2.0-S2211379722007082-main.pdf'
    with open(filename, 'rb') as f:
        doc = PDF.loads(f)
    print(" id %s " % doc.get_xmp_document_info().get_document_id())
    print(" authors %s " % doc.get_xmp_document_info().get_author())
    print(" creator %s " % doc.get_xmp_document_info().get_creator())

test()

output:
 id uuid:1b499bed-4ce8-4c0c-b682-0d58baae1cbe 
 authors Xiaofan Du 
 creator None 

python version: 3.10 borb version: 2.1.18