metachris / pdfx

Extract text, metadata and references (pdf, url, doi, arxiv) from PDF. Optionally download all referenced PDFs.
http://www.metachris.com/pdfx
Apache License 2.0
1.03k stars 113 forks source link

added sorting option by using list #59

Open Masterjx9 opened 1 year ago

Masterjx9 commented 1 year ago

whenever I used references = pdf.get_references_as_dict(sort=True) it would fail saying:

  File "C:\Users\user\Scripts\PDFx\test.py", line 9, in <module>
    references = pdf.get_references_as_dict(sort=True)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "c:\users\user\scripts\pdfx\pdfx\pdfx\__init__.py", line 168, in get_references_as_dict
    return self.reader.get_references_as_dict(reftype=reftype, sort=sort)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "c:\users\user\scripts\pdfx\pdfx\pdfx\backends.py", line 177, in get_references_as_dict
    for r in sorted(refs) if sort else refs:

This makes sense because you can't sort a set. However you can ensure that the data is sorted by using a list from the beginning. Thus I am a parameter in the PDFx called that allows you to select if you want to use a set or a list. The backends.py has been updated to handle if its a set or list. Also the sort param from the get_references_as_dict and get_references function has been removed.

It can be used like this pdf = pdfx.PDFx("test.pdf", references_data_structure="list") or like this pdf = pdfx.PDFx("test.pdf", references_data_structure="list")