Closed mjpost closed 6 years ago
It seems to originate from a strange behaviour in pybtex. Given this entry:
@InProceedings{blablublab,
author = {R. Costa-juss\`{a}, Marta and Crego, Josep M. and Vilar, David and R. Fonollosa, Jos\'{e} A. and Mari\~{n}o, Jos\'{e} B. and Ney, Hermann},
title = {Analysis and System Combination of Phrase- and {N}-Gram-Based Statistical Machine Translation Systems},
booktitle = {Human Language Technologies 2007: The Conference of the North American Chapter of the Association for Computational Linguistics; Companion Volume, Short Papers},
month = {April},
year = {2007},
address = {Rochester, New York},
publisher = {Association for Computational Linguistics},
pages = {137--140},
url = {http://www.aclweb.org/anthology/N/N07/N07-2035}
}
A print(entry)
just after parsing by pybtex produces this
Entry('inproceedings', fields=[('title', 'Analysis and System Combination of Phrase- and {N}-Gram-Based Statistical Machine Translation Systems'), ('booktitle', 'Human Language Technologies 2007: The Conference of the North American Chapter of the Association for Computational Linguistics; Companion Volume, Short Papers'), ('month', 'April'), ('year', '2007'), ('address', 'Rochester, New York'), ('publisher', 'Association for Computational Linguistics'), ('pages', '137--140'), ('url', 'http://www.aclweb.org/anthology/N/N07/N07-2035')], persons=OrderedCaseInsensitiveDict([('author', [Person('R. Costa-juss\\`{a}, Marta'), Person('Crego, Josep M.'), Person('Vilar, David'), Person("R. Fonollosa, Jos\\'{e} A."), Person("Mari\\ {n}o, Jos\\'{e} B."), Person('Ney, Hermann')])]))
Note the \\ {n}
for "Mariño". I am not sure where the asciitilde itself is generated.
Maybe we can use latexcodec directly? http://latexcodec.readthedocs.io/en/latest/
If I understand it correctly, this would supersede the conversion functionality we have in bibutils.
I was tinkering a little bit with this and I was able to solve the issue with Mariño by passing the string first through latexcodec. A PR will follow after some more testing.
The URL issue remained, however. But after thinking about it for a little while I realized that it is actually doing (nearly) the right thing! BibTeX entries are supposed to be consumed by LaTeX, and if we want to have a ~ in a LaTeX document you have to escape it (e.g. \textasciitilde). What is actually missing is a {} after the command, which is unfortunate, but we probably can do some ad-hoc correction. And of course we should take this into account for the open
command.
We could also just do this as post processing, eg use the latex decoder before we output strings in the result formatting. I have a fix for this and will push when I have internet from my computer (it’s semi urgent because it escapes _ in URLs which breaks “open”).
e.g., try to import this:
The
~
in the URL gets transformed tohttp://www.cs.jhu.edu/ \textasciitildephi/publications/ neural-interactive-translation.pdf
.