sciunto-org / python-bibtexparser

Bibtex parser for Python 3
https://bibtexparser.readthedocs.io
MIT License
474 stars 132 forks source link

pure text non latex results #352

Closed WolfgangFahl closed 1 year ago

WolfgangFahl commented 1 year ago

currently i am doing:

doi=DOI(self.doi)
meta_bibtex=doi.fetchBibtexMeta()
bd=bibtexparser.loads(meta_bibtex)
btex=bd.entries[0]

using the DOI helper class below. I was hoping to simplify my life since the citeproc result looks quite complicated and i'd love to have some cleanup in e.g. authors and titles.

The bibtexparser does a great job but i don'want a latex result but just clear text.

E.g for 10.1145/800001.811672 i get

The structure of the {\\textquotedblleft}the{\\textquotedblright}-multiprogramming system

While the plain text

The structure of the "the"-multiprogramming system

would be better for my use case. Is this already possible with the current bibtexparser or a feature request?

doi.py

'''
Created on 2023-02-12

@author: wf
'''
import urllib.request
import json
from dataclasses import dataclass

@dataclass
class DOI:
    """
    get DOI data
    """
    doi:str

    def fetchMeta(self,headers:dict)->dict:
        """
        get the metadata for my doi

        Args:
            headers(dict): the headers to use

        Returns:
            dict: the metadata according to the given headers
        """
        url=f"https://doi.org/{self.doi}"
        req=urllib.request.Request(url,headers=headers)
        response=urllib.request.urlopen(req)
        encoding = response.headers.get_content_charset('utf-8')
        content = response.read()
        text = content.decode(encoding)
        return text

    def fetchBibtexMeta(self)->dict:
        """
        get the meta data for my  doi by getting the bibtext JSON 
        result for the doi

        Returns:
            dict: metadata

        """
        headers= {
            'Accept': 'application/x-bibtex; charset=utf-8'
        }
        text=self.fetchMeta(headers)
        return text

    def fetchCiteprocMeta(self)->dict:
        """
        get the meta data for my  doi by getting the Citeproc JSON 
        result for the doi

        see https://citeproc-js.readthedocs.io/en/latest/csl-json/markup.html

        Returns:
            dict: metadata
        """
        headers= {
            'Accept': 'application/vnd.citationstyles.csl+json; charset=utf-8'
        }
        text=self.fetchMeta(headers)
        json_data=json.loads(text)
        return json_data   
MiWeiss commented 1 year ago

Hi @WolfgangFahl

While you could use bibtexparser-customizers to achieve much of what you're trying to do. But I guess you're much better off by just using a latex parser on the strings returned by bibtexparser (e.g. https://github.com/phfaist/pylatexenc/).

Bibtexparser v2 will actually leverage such a parser internally. However, there's still a long way to go before that's going to be released ;-)

I'm closing this issue as it does not require a code change, but feel free to add follow-up remarks...

WolfgangFahl commented 1 year ago

Indeed you hint is correct. See also my question https://stackoverflow.com/questions/75426142/pure-text-non-latex-results-for-python-bibtex-parser

from pylatexenc.latex2text import LatexNodes2Text

ln2t=LatexNodes2Text()
for key in btex:
   latex=btex[key]
   no_latex=ln2t.latex_to_text(latex)
   btex[key]=no_latex

will convert the latex dict entries back to text. If there are others who need it it might be added as a convenience function.

Example bibtex

@inproceedings{Dijkstra_1967,
    doi = {10.1145/800001.811672},
    url = {https://doi.org/10.1145%2F800001.811672},
    year = 1967,
    publisher = {{ACM} Press},
    author = {Edsger W. Dijkstra},
    title = {The structure of the {\textquotedblleft}the{\textquotedblright}-multiprogramming system},
    booktitle = {Proceedings of the {ACM} symposium on Operating System Principles  - {SOSP} {\textquotesingle}67}
}

dict with latex

{
  "booktitle": "Proceedings of the {ACM} symposium on Operating System Principles  - {SOSP} {\\textquotesingle}67",
  "title": "The structure of the {\\textquotedblleft}the{\\textquotedblright}-multiprogramming system",
  "author": "Edsger W. Dijkstra",
  "publisher": "{ACM} Press",
  "year": "1967",
  "url": "https://doi.org/10.1145%2F800001.811672",
  "doi": "10.1145/800001.811672",
  "ENTRYTYPE": "inproceedings",
  "ID": "Dijkstra_1967"
}

dict with plaintext (utf-8)

{
  "booktitle": "Proceedings of the ACM symposium on Operating System Principles  - SOSP '67",
  "title": "The structure of the \u201cthe\u201d-multiprogramming system",
  "author": "Edsger W. Dijkstra",
  "publisher": "ACM Press",
  "year": "1967",
  "url": "https://doi.org/10.1145",
  "doi": "10.1145/800001.811672",
  "ENTRYTYPE": "inproceedings",
  "ID": "Dijkstra_1967"
}