Open csebban opened 3 years ago
Thanks for the report. I have run into this problem before with this surname-in-capitals citation style (it comes up in some German bibliographies too).
After the bits of a citation are tagged as particular fields, e.g. "author", AnyStyle does further processing ("normalizing"), e.g. to pick out first names/initials and family names. It assumes that the data being processed are in canonical form, so the capitals are confusing (it is a formatting feature of this citation style, not how the names really are).
A first step would be to switch this off before you parse the strings
parser = AnyStyle::Parser.new
parser.normalizers.delete_if { | norm | norm.kind_of? AnyStyle::Normalizer::Names }
parser.parse(cite_strings)
See how your results look after this. You might need to teach the parser further examples of this style so the correct parts of each citation are labelled as author. The default training set probably assigns a low probability to all-caps strings being an author name.
If from there you actually want to transform the names into a canonical form, you will need to either 1) pre-process your input or 2) look at writing your own name normalizer, so that ACHIN, Catherine → Achin, Catherine. See: https://github.com/inukshuk/anystyle/blob/master/lib/anystyle/normalizer/names.rb
Incidentally, I have never seen "et alii" written out in full in a citation, is this common in your set?
Also, is "dir." a common abbreviation in French bibliographies? (I'm assuming it's something like dirigent·e·s - at the moment AnyStyle only knows editeurs.)
The conversion of the upper-case names to initials, is probably because of this normalization detail which was added at some point to support Vancouver-style names. I think we should add an option to the normalizer to turn this off.
All-upper-case surnames are still not ideal for the name parser, but in the worst case the names should at least come out as literals and not be wrongly interpreted as initials.
Thanks for the pointer @inukshuk . I agree we probably want options to switch this normalizer on and off or adjust how it works.
I wonder if we want to look at how the normalizers are set up so it's a more general solution, because my suggested workaround (parser.normalizers.delete_if { | norm | norm.kind_of? AnyStyle::Normalizer::Names }
) is pretty obscure, but it's the kind of thing I've ended up doing.
The constructor for AnyStyle::Parser
could use a little love anyway, it has this https://github.com/inukshuk/anystyle/blob/master/lib/anystyle/parser.rb#L21 which breaks if you try to inherit from it.
Thank you very much to both of you for this solution. I will try to run the program again with these corrections and let you know
Best regards
Cecile Sebban
Le 12/10/2021 à 01:47, Alex Fenton a écrit :
Thanks for the pointer @inukshuk https://github.com/inukshuk . I agree we probably want options to switch this normalizer on and off or adjust how it works.
I wonder if we want to look at how the normalizers are set up so it's a more general solution, because my suggested workaround (|parser.normalizers.delete_if { | norm | norm.kind_of? AnyStyle::Normalizer::Names }|) is pretty obscure, but it's the kind of thing I've ended up doing.
The constructor for |AnyStyle::Parser| could use a little love anyway, it has this https://github.com/inukshuk/anystyle/blob/master/lib/anystyle/parser.rb#L21 https://github.com/inukshuk/anystyle/blob/master/lib/anystyle/parser.rb#L21 which breaks if you try to inherit from it.
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/inukshuk/anystyle/issues/174#issuecomment-940522415, or unsubscribe https://github.com/notifications/unsubscribe-auth/AV6DFLQYF5YISVNRIOZFSMDUGNZQXANCNFSM5FOFETOQ. Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.
-- Logo Rennes 2 http://www.univ-rennes2.fr Cécile Sebban Responsable du département Recherche BU Rennes 2 02 99 14 12 62 Logo Facebook http://www.facebook.com/UnivRennes2 Logo Twitter http://twitter.com/UnivRennes_2 Logo Instagram http://instagram.com/univrennes2/ Du bon usage du mail : ce mail n’engage pas de réponse en dehors de vos horaires de travail. Plus d'informations https://intranet.univ-rennes2.fr/system/files/UHB/DRH/ACTION-SOCIALE/plaqu_dubonusage.b-1.pdf
@a-fent how does inheritance break this line? Both Parser
and Finder
inherit from ParserCore
just fine. But I fully agree with the larger point; the idea of the normalizers should be that you can easily customize, re-order, disable or enable them to fit your needs -- and that's definitely not been achieved yet.
Instead of deleting the name normalizer you could also skip it like this:
parser.normalizers.detect { |n| n.kind_of? AnyStyle::Normalizer::Names }.skip = true
We could add dedicated methods to make this a little less verbose, e.g.:
parser.disable_normalizer AnyStyle::Normalizer::Names
Or even:
parser.disable_normalizer 'names'
However, at the time I did not want to rule out that you could add a normalizer more than once to the list; for example, I could imagine adding a normalizer twice with a different set of options or in different places if the order of execution is important.
@a-fent how does inheritance break this line? Both
Parser
andFinder
inherit fromParserCore
just fine.
If you try to inherit from Parser in your own code (e.g. to tweak the normalizers), it'll throw an obscure error in merge
in initialize
because the instance variable @defaults isn't defined in your own class. I usually prefer constants over class instance variables for this, that OK with you here?
We could add dedicated methods to make this a little less verbose, e.g.: ... However, at the time I did not want to rule out that you could add a normalizer more than once to the list; for example, I could imagine adding a normalizer twice with a different set of options or in different places if the order of execution is important.
I really like the way normalizers are designed. it's just the interface that's lacking. I'd suggest the methods prepend_normalizer
, append_normalizer
and remove_normalizer
, which could each take a class, string name of class or an instance. Can implement this if you're in favour.
Oh, I see what you mean now. @defaults
is a class instance variable and I actually prefer using it over constants, because you can easily use custom defaults in the sub-class this way. We could change the code to use an empty hash as a fallback so you don't have to define the instance yourself, but I figured that in most cases you would want to define custom defaults in a sub-class anyway.
So this is a minimal example that should work:
class MyParser < AnyStyle::Parser
@defaults = {}
end
For the normalizers the intention was that you don't actually need to use your own sub-class, but just adjust the normalizers in your parser instance, like you already do. I agree that the interface is still lacking but at the same time I fail to see the benefit of adding append
, remove
and similar options instead of just using the array:
parser.append_normalizer Normalizer::Names.new`
Is not an improvement over:
parser.normalizers << Normalizer::Names.new
I think this was actually my initial 'problem': I didn't really know how you'd want to adjust the normalizers. I would assume that in most cases you'll want to skip existing ones, change their options, or add custom ones at the end or at a specific position. I couldn't come up with APIs for this that were considerably easier to use than the Enumerable API we already have.
Oh, I see what you mean now.
@defaults
is a class instance variable and I actually prefer using it over constants, because you can easily use custom defaults in the sub-class this way.
Using self.class.const_get :FOO
in a parent class achieves the same thing transparently without placing any expectations on any user subclass. But mostly I just prefer constants for readability and not needing to define the i.v. readers.
Anyway I wouldn't want to change the code style just for the sake of it, but I do think subclassing Parser is reasonable (e.g. I have quite specialised Parsers with custom features and normalizers embedded in larger apps). So maybe just use Parser#defaults if MyParser#defaults is nil?
For the normalizers the intention was that you don't actually need to use your own sub-class, but just adjust the normalizers in your parser instance, like you already do. I agree that the interface is still lacking but at the same time I fail to see the benefit of adding
append
,remove
and similar options instead of just using the array:
I'm mostly with you on this and I don't love append_normalizer
other than that 1) it signals how the implementation works and 2) as a counterpart to any remove_normalizer
method (for which the direct manipulation with delete_if is a bit ugly). But probably better just to document this.
Yes, let's add a fallback in order not place any extra expectation on subclasses. Not having done a lot with Ruby in recent years I'm not confident that my erstwhile infatuation with class instance variables stands the test of time. I distinctly remember that one of the reasons I liked the pattern was that it allowed to override especially the reader methods. In any case, the pattern is used a lot so I'll stick with for now.
How about we add a shorthand method to select specific normalizers easily? That would be useful for adjusting options and also to skip or re-enable a normalizer. Something like #normalizer()
or #get_normalizer()
accepting either a class or class name as a string or symbol. I know I've selected normalizers via their class in the past and you also seem to have adopted the same approach so I think a shorthand for this could be helpful. Similarly, we could add a delete_normalizer
if you prefer this over just flipping the skip
flag.
Thanks, I like that idea with easier access to normalizer. The only wrinkle with skip()
is that it changes that value for an instance of Normalizer
that is actually shared among all instances of Parser
. Probably not a problem in most cases, but could be a trap for apps that have multiple different-configured Parsers in parallel. Anyway, I'll put an hour aside and develop something in a branch that you can look at.
The normalizer instances should be bound to a specific parser instance, not shared across all of them. Maybe we have an error somewhere that causes this? Out of the box it looks fine, for example:
>> require 'anystyle'
=> true
>> p1 = AnyStyle::Parser.new
=>
#<AnyStyle::Parser:0x000055964bb81ac8
...
>> p2 = AnyStyle::Parser.new
=>
#<AnyStyle::Parser:0x000055964d888ea0
...
>> p1.normalizers[15].skip = true
=> true
>> p1.normalizers[15].skip?
=> true
>> p2.normalizers[15].skip?
=> false
NVM, misread the code, thought the @normalizers= was in class not instance scope.
Author family names are not properly recognized, and get capitalized with dots (as in honor titles). Ex : initial biblio ACHIN Catherine et alii, Sexes, genre et politique, Paris, Economica, 2007. BAJOS Nathalie et BOZON Michel (dir.), Enquête sur la sexualité en France. Pratiques, genre et santé, Paris, La Découverte, 2008. BANTIGNY Ludivine, BUGNON Fanny et GALLOT Fanny (dir.), « Prolétaires de tous les pays, qui lave vos chaussettes ? ». Le genre de l’engagement dans les années 68, Rennes, PUR, 2017.
Bibtex generated : @book{achin_sexes_2007, address = {Paris}, title = {Sexes, genre et politique}, language = {it}, publisher = {Economica}, author = {ACHIN, Catherine et alii}, year = {2007}, }
@book{bozon_michel_enquete_2008, address = {Paris}, title = {Enquête sur la sexualité en {France}. {Pratiques}, genre et santé}, language = {fr}, publisher = {La Découverte}, author = {BOZON Michel, B.A.J.O.S.Nathalie}, year = {2008}, }
@book{ludivine_proletaires_2017, address = {Rennes}, title = {Prolétaires de tous les pays, qui lave vos chaussettes ? ». {Le} genre de l’engagement dans les années 68}, language = {fr}, publisher = {PUR}, author = {Ludivine, B.A.N.T.I.G.N.Y. and GALLOT Fanny, B.U.G.N.O.N.Fanny}, year = {2017}, }
I tried to tag again the authors, but it does not seem to work.
Thank you for your help.