inukshuk / anystyle

Fast citation reference parsing
https://anystyle.io
Other
1.02k stars 87 forks source link

issue with the "find" command #152

Open sylvainloiseau opened 3 years ago

sylvainloiseau commented 3 years ago

Some issue with find. Maybe I'm missing something obvious.

$ anystyle find booklet.txt Error processing BrochureBCSDL.tex' undefined methodsplit' for nil:NilClass /Library/Ruby/Gems/2.6.0/gems/anystyle-1.3.12/lib/anystyle/document.rb:11:in parse' /Library/Ruby/Gems/2.6.0/gems/anystyle-1.3.12/lib/anystyle/document.rb:40:inopen' ...

inukshuk commented 3 years ago

Thanks for reporting! Could you share the booklet.txt file with me to let me reproduce the issue?

sylvainloiseau commented 3 years ago

Sure, here it is. Thanks for looking into this. If I rename the file with a .txt extension it seems to works… (but it matches only references not starting with tex encoding such as ‘{‘) — maybe it’s normal).

$ anystyle --version anystyle version 1.3.12 (cli 1.3.1, data 1.2.0)

Sylvain

Le 10 sept. 2020 à 09:18, Sylvester Keil notifications@github.com a écrit :

Thanks for reporting! Could you share the booklet.txt file with me to let me reproduce the issue?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or unsubscribe.


Sylvain Loiseau sylvain.loiseau@univ-paris13.fr

Université Paris 13 Sorbonne-Paris-Cité 99 avenue Jean-Baptiste Clément F-93430 Villetaneuse

Laboratoire « Langues et civilisations à tradition orale » (UMR 7107 CNRS) Campus CNRS 7, rue Guy Môquet (bât. D) F-94801 Villejuif Cedex http://lacito.vjf.cnrs.fr

inukshuk commented 3 years ago

E-mail attachments don't get through, you need to attach them here on GitHub.

Anyway, the file extension is important, because the Finder module handles input files differently based on it. The txt input is intended for plain text documents, ideally without additional mark up -- normally this is the result of a conversion from PDF to text. If you're working with a TeX file as input it's very likely that either the finder or parser modules will get confused by the additional markup. It's something that could be trained into the model, but actually, what I would suggest in this case is to generate a plain text from the TeX and parse that instead of the source, as you might get better results this way.

sylvainloiseau commented 3 years ago

Ok, sorry for the false issue. It is in fact due to the tex extension. Under a tex extension, it fails even with no tex markup. The following content, in a file named « booklet.tex », produce the same error :

Thanks for this very useful and nicely designed tool, Sylvain

==== Nicholas Evans (2012) \emph{Ces mots qui meurent. Les langues menacées et ce qu'elles ont à nous dire}, La Découverte, Paris. ====

$ anystyle find booklet.tex Error processing booklet.tex' undefined methodsplit' for nil:NilClass /Library/Ruby/Gems/2.6.0/gems/anystyle-1.3.12/lib/anystyle/document.rb:11:in parse' /Library/Ruby/Gems/2.6.0/gems/anystyle-1.3.12/lib/anystyle/document.rb:40:inopen' ...

Le 10 sept. 2020 à 11:26, Sylvester Keil notifications@github.com a écrit :

E-mail attachments don't get through, you need to attach them here on GitHub.

Anyway, the file extension is important, because the Finder module handles input files differently based on it. The txt input is intended for plain text documents, ideally without additional mark up -- normally this is the result of a conversion from PDF to text. If you're working with a TeX file as input it's very likely that either the finder or parser modules will get confused by the additional markup. It's something that could be trained into the model, but actually, what I would suggest in this case is to generate a plain text from the TeX and parse that instead of the source, as you might get better results this way.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or unsubscribe.


Sylvain Loiseau sylvain.loiseau@univ-paris13.fr

Université Paris 13 Sorbonne-Paris-Cité 99 avenue Jean-Baptiste Clément F-93430 Villetaneuse

Laboratoire « Langues et civilisations à tradition orale » (UMR 7107 CNRS) Campus CNRS 7, rue Guy Môquet (bât. D) F-94801 Villejuif Cedex http://lacito.vjf.cnrs.fr