inukshuk / anystyle

Fast citation reference parsing
https://anystyle.io
Other
1.05k stars 90 forks source link

Problem with author names not recognized properly #174

Open csebban opened 3 years ago

csebban commented 3 years ago

Author family names are not properly recognized, and get capitalized with dots (as in honor titles). Ex : initial biblio ACHIN Catherine et alii, Sexes, genre et politique, Paris, Economica, 2007. BAJOS Nathalie et BOZON Michel (dir.), Enquête sur la sexualité en France. Pratiques, genre et santé, Paris, La Découverte, 2008. BANTIGNY Ludivine, BUGNON Fanny et GALLOT Fanny (dir.), « Prolétaires de tous les pays, qui lave vos chaussettes ? ». Le genre de l’engagement dans les années 68, Rennes, PUR, 2017.

Bibtex generated : @book{achin_sexes_2007, address = {Paris}, title = {Sexes, genre et politique}, language = {it}, publisher = {Economica}, author = {ACHIN, Catherine et alii}, year = {2007}, }

@book{bozon_michel_enquete_2008, address = {Paris}, title = {Enquête sur la sexualité en {France}. {Pratiques}, genre et santé}, language = {fr}, publisher = {La Découverte}, author = {BOZON Michel, B.A.J.O.S.Nathalie}, year = {2008}, }

@book{ludivine_proletaires_2017, address = {Rennes}, title = {Prolétaires de tous les pays, qui lave vos chaussettes ? ». {Le} genre de l’engagement dans les années 68}, language = {fr}, publisher = {PUR}, author = {Ludivine, B.A.N.T.I.G.N.Y. and GALLOT Fanny, B.U.G.N.O.N.Fanny}, year = {2017}, }

I tried to tag again the authors, but it does not seem to work.

Thank you for your help.

a-fent commented 3 years ago

Thanks for the report. I have run into this problem before with this surname-in-capitals citation style (it comes up in some German bibliographies too).

After the bits of a citation are tagged as particular fields, e.g. "author", AnyStyle does further processing ("normalizing"), e.g. to pick out first names/initials and family names. It assumes that the data being processed are in canonical form, so the capitals are confusing (it is a formatting feature of this citation style, not how the names really are).

A first step would be to switch this off before you parse the strings

parser = AnyStyle::Parser.new
parser.normalizers.delete_if { | norm | norm.kind_of? AnyStyle::Normalizer::Names }
parser.parse(cite_strings)

See how your results look after this. You might need to teach the parser further examples of this style so the correct parts of each citation are labelled as author. The default training set probably assigns a low probability to all-caps strings being an author name.

If from there you actually want to transform the names into a canonical form, you will need to either 1) pre-process your input or 2) look at writing your own name normalizer, so that ACHIN, Catherine → Achin, Catherine. See: https://github.com/inukshuk/anystyle/blob/master/lib/anystyle/normalizer/names.rb

Incidentally, I have never seen "et alii" written out in full in a citation, is this common in your set?

a-fent commented 3 years ago

Also, is "dir." a common abbreviation in French bibliographies? (I'm assuming it's something like dirigent·e·s - at the moment AnyStyle only knows editeurs.)

inukshuk commented 3 years ago

The conversion of the upper-case names to initials, is probably because of this normalization detail which was added at some point to support Vancouver-style names. I think we should add an option to the normalizer to turn this off.

All-upper-case surnames are still not ideal for the name parser, but in the worst case the names should at least come out as literals and not be wrongly interpreted as initials.

a-fent commented 3 years ago

Thanks for the pointer @inukshuk . I agree we probably want options to switch this normalizer on and off or adjust how it works.

I wonder if we want to look at how the normalizers are set up so it's a more general solution, because my suggested workaround (parser.normalizers.delete_if { | norm | norm.kind_of? AnyStyle::Normalizer::Names }) is pretty obscure, but it's the kind of thing I've ended up doing.

The constructor for AnyStyle::Parser could use a little love anyway, it has this https://github.com/inukshuk/anystyle/blob/master/lib/anystyle/parser.rb#L21 which breaks if you try to inherit from it.

csebban commented 3 years ago

Thank you very much to both of you for this solution. I will try to run the program again with these corrections and let you know

Best regards

Cecile Sebban

Le 12/10/2021 à 01:47, Alex Fenton a écrit :

Thanks for the pointer @inukshuk https://github.com/inukshuk . I agree we probably want options to switch this normalizer on and off or adjust how it works.

I wonder if we want to look at how the normalizers are set up so it's a more general solution, because my suggested workaround (|parser.normalizers.delete_if { | norm | norm.kind_of? AnyStyle::Normalizer::Names }|) is pretty obscure, but it's the kind of thing I've ended up doing.

The constructor for |AnyStyle::Parser| could use a little love anyway, it has this https://github.com/inukshuk/anystyle/blob/master/lib/anystyle/parser.rb#L21 https://github.com/inukshuk/anystyle/blob/master/lib/anystyle/parser.rb#L21 which breaks if you try to inherit from it.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/inukshuk/anystyle/issues/174#issuecomment-940522415, or unsubscribe https://github.com/notifications/unsubscribe-auth/AV6DFLQYF5YISVNRIOZFSMDUGNZQXANCNFSM5FOFETOQ. Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

-- Logo Rennes 2 http://www.univ-rennes2.fr Cécile Sebban Responsable du département Recherche BU Rennes 2 02 99 14 12 62 Logo Facebook http://www.facebook.com/UnivRennes2 Logo Twitter http://twitter.com/UnivRennes_2 Logo Instagram http://instagram.com/univrennes2/ Du bon usage du mail : ce mail n’engage pas de réponse en dehors de vos horaires de travail. Plus d'informations https://intranet.univ-rennes2.fr/system/files/UHB/DRH/ACTION-SOCIALE/plaqu_dubonusage.b-1.pdf

inukshuk commented 3 years ago

@a-fent how does inheritance break this line? Both Parser and Finder inherit from ParserCore just fine. But I fully agree with the larger point; the idea of the normalizers should be that you can easily customize, re-order, disable or enable them to fit your needs -- and that's definitely not been achieved yet.

Instead of deleting the name normalizer you could also skip it like this:

parser.normalizers.detect { |n| n.kind_of? AnyStyle::Normalizer::Names }.skip = true

We could add dedicated methods to make this a little less verbose, e.g.:

parser.disable_normalizer AnyStyle::Normalizer::Names

Or even:

parser.disable_normalizer 'names'

However, at the time I did not want to rule out that you could add a normalizer more than once to the list; for example, I could imagine adding a normalizer twice with a different set of options or in different places if the order of execution is important.

a-fent commented 3 years ago

@a-fent how does inheritance break this line? Both Parser and Finder inherit from ParserCore just fine.

If you try to inherit from Parser in your own code (e.g. to tweak the normalizers), it'll throw an obscure error in merge in initialize because the instance variable @defaults isn't defined in your own class. I usually prefer constants over class instance variables for this, that OK with you here?

We could add dedicated methods to make this a little less verbose, e.g.: ... However, at the time I did not want to rule out that you could add a normalizer more than once to the list; for example, I could imagine adding a normalizer twice with a different set of options or in different places if the order of execution is important.

I really like the way normalizers are designed. it's just the interface that's lacking. I'd suggest the methods prepend_normalizer, append_normalizer and remove_normalizer, which could each take a class, string name of class or an instance. Can implement this if you're in favour.

inukshuk commented 3 years ago

Oh, I see what you mean now. @defaults is a class instance variable and I actually prefer using it over constants, because you can easily use custom defaults in the sub-class this way. We could change the code to use an empty hash as a fallback so you don't have to define the instance yourself, but I figured that in most cases you would want to define custom defaults in a sub-class anyway.

So this is a minimal example that should work:

class MyParser < AnyStyle::Parser
  @defaults = {}
end

For the normalizers the intention was that you don't actually need to use your own sub-class, but just adjust the normalizers in your parser instance, like you already do. I agree that the interface is still lacking but at the same time I fail to see the benefit of adding append, remove and similar options instead of just using the array:

parser.append_normalizer Normalizer::Names.new`

Is not an improvement over:

parser.normalizers << Normalizer::Names.new

I think this was actually my initial 'problem': I didn't really know how you'd want to adjust the normalizers. I would assume that in most cases you'll want to skip existing ones, change their options, or add custom ones at the end or at a specific position. I couldn't come up with APIs for this that were considerably easier to use than the Enumerable API we already have.

a-fent commented 3 years ago

Oh, I see what you mean now. @defaults is a class instance variable and I actually prefer using it over constants, because you can easily use custom defaults in the sub-class this way.

Using self.class.const_get :FOO in a parent class achieves the same thing transparently without placing any expectations on any user subclass. But mostly I just prefer constants for readability and not needing to define the i.v. readers.

Anyway I wouldn't want to change the code style just for the sake of it, but I do think subclassing Parser is reasonable (e.g. I have quite specialised Parsers with custom features and normalizers embedded in larger apps). So maybe just use Parser#defaults if MyParser#defaults is nil?

For the normalizers the intention was that you don't actually need to use your own sub-class, but just adjust the normalizers in your parser instance, like you already do. I agree that the interface is still lacking but at the same time I fail to see the benefit of adding append, remove and similar options instead of just using the array:

I'm mostly with you on this and I don't love append_normalizer other than that 1) it signals how the implementation works and 2) as a counterpart to any remove_normalizer method (for which the direct manipulation with delete_if is a bit ugly). But probably better just to document this.

inukshuk commented 3 years ago

Yes, let's add a fallback in order not place any extra expectation on subclasses. Not having done a lot with Ruby in recent years I'm not confident that my erstwhile infatuation with class instance variables stands the test of time. I distinctly remember that one of the reasons I liked the pattern was that it allowed to override especially the reader methods. In any case, the pattern is used a lot so I'll stick with for now.

How about we add a shorthand method to select specific normalizers easily? That would be useful for adjusting options and also to skip or re-enable a normalizer. Something like #normalizer() or #get_normalizer() accepting either a class or class name as a string or symbol. I know I've selected normalizers via their class in the past and you also seem to have adopted the same approach so I think a shorthand for this could be helpful. Similarly, we could add a delete_normalizer if you prefer this over just flipping the skip flag.

a-fent commented 3 years ago

Thanks, I like that idea with easier access to normalizer. The only wrinkle with skip() is that it changes that value for an instance of Normalizer that is actually shared among all instances of Parser. Probably not a problem in most cases, but could be a trap for apps that have multiple different-configured Parsers in parallel. Anyway, I'll put an hour aside and develop something in a branch that you can look at.

inukshuk commented 3 years ago

The normalizer instances should be bound to a specific parser instance, not shared across all of them. Maybe we have an error somewhere that causes this? Out of the box it looks fine, for example:

>> require 'anystyle'
=> true
>> p1 = AnyStyle::Parser.new
=> 
#<AnyStyle::Parser:0x000055964bb81ac8
...
>> p2 = AnyStyle::Parser.new
=> 
#<AnyStyle::Parser:0x000055964d888ea0
...
>> p1.normalizers[15].skip = true
=> true
>> p1.normalizers[15].skip?
=> true
>> p2.normalizers[15].skip?
=> false
a-fent commented 3 years ago

NVM, misread the code, thought the @normalizers= was in class not instance scope.