mjpost / bibsearch

Download, manage, and search a BibTeX database.
Other
63 stars 5 forks source link

incorrect reformatting of URLs #18

Closed mjpost closed 6 years ago

mjpost commented 6 years ago

e.g., try to import this:

@inproceedings{knowles2016neural,
  title={Neural interactive translation prediction},
  author={Knowles, Rebecca and Koehn, Philipp},
  booktitle={Proceedings of the Association for Machine Translation in the Americas},
  pages={107--120},
  year={2016},
  url="http://www.cs.jhu.edu/~phi/publications/neural-interactive-translation.pdf",
}

The ~ in the URL gets transformed to http://www.cs.jhu.edu/ \textasciitildephi/publications/ neural-interactive-translation.pdf.

davvil commented 6 years ago

It seems to originate from a strange behaviour in pybtex. Given this entry:

@InProceedings{blablublab,
  author    = {R. Costa-juss\`{a}, Marta  and  Crego, Josep M.  and  Vilar, David  and  R. Fonollosa, Jos\'{e} A.  and  Mari\~{n}o, Jos\'{e} B.  and  Ney, Hermann},
  title     = {Analysis and System Combination of Phrase- and {N}-Gram-Based Statistical Machine Translation Systems},
  booktitle = {Human Language Technologies 2007: The Conference of the North American Chapter of the Association for Computational Linguistics; Companion Volume, Short Papers},
  month     = {April},
  year      = {2007},
  address   = {Rochester, New York},
  publisher = {Association for Computational Linguistics},
  pages     = {137--140},
  url       = {http://www.aclweb.org/anthology/N/N07/N07-2035}
}

A print(entry) just after parsing by pybtex produces this

Entry('inproceedings', fields=[('title', 'Analysis and System Combination of Phrase- and {N}-Gram-Based Statistical Machine Translation Systems'), ('booktitle', 'Human Language Technologies 2007: The Conference of the North American Chapter of the Association for Computational Linguistics; Companion Volume, Short Papers'), ('month', 'April'), ('year', '2007'), ('address', 'Rochester, New York'), ('publisher', 'Association for Computational Linguistics'), ('pages', '137--140'), ('url', 'http://www.aclweb.org/anthology/N/N07/N07-2035')], persons=OrderedCaseInsensitiveDict([('author', [Person('R. Costa-juss\\`{a}, Marta'), Person('Crego, Josep M.'), Person('Vilar, David'), Person("R. Fonollosa, Jos\\'{e} A."), Person("Mari\\ {n}o, Jos\\'{e} B."), Person('Ney, Hermann')])]))

Note the \\ {n} for "Mariño". I am not sure where the asciitilde itself is generated.

davvil commented 6 years ago

Related? https://bitbucket.org/pybtex-devs/pybtex/issues/64/bug-tilde-in-is-parsed-as-space

davvil commented 6 years ago

Maybe we can use latexcodec directly? http://latexcodec.readthedocs.io/en/latest/

If I understand it correctly, this would supersede the conversion functionality we have in bibutils.

davvil commented 6 years ago

I was tinkering a little bit with this and I was able to solve the issue with Mariño by passing the string first through latexcodec. A PR will follow after some more testing.

The URL issue remained, however. But after thinking about it for a little while I realized that it is actually doing (nearly) the right thing! BibTeX entries are supposed to be consumed by LaTeX, and if we want to have a ~ in a LaTeX document you have to escape it (e.g. \textasciitilde). What is actually missing is a {} after the command, which is unfortunate, but we probably can do some ad-hoc correction. And of course we should take this into account for the open command.

mjpost commented 6 years ago

We could also just do this as post processing, eg use the latex decoder before we output strings in the result formatting. I have a fix for this and will push when I have internet from my computer (it’s semi urgent because it escapes _ in URLs which breaks “open”).