The patterns for journals in the DeduplicationService ::compareJournals_... methods originally used alternation "(\b|-|)" and anchored some pattern to the beginning of the string ("^").
This was a stupid alternation, was the same as ".*". Replacing this alternation with "\b" makes more sense, but should be tested.
Anchoring to beginning of the string may be too strict
Tests
One case where alternation leads to less False Negatives: in ASyDS_Depression: 2 False Negatives less.
"The Journal of the Kentucky Medical Association.95 (4) ()(pp 145-148) 1997.Date of Publication: Apr 1997." compared as "Journal of the Kentucky Medical Association"
"J.Ky.Med.Assoc." compared as "J Ky Med Assoc"
"Ky" as an abnormal abbreviation for "Kentucky" matches "Journal of the Kentucky Medical Association"
2 cases where alternation leads to more False Negatives (ASySD_SRSR_Human: 101 --> 112, SRA2_Cytology_screening: 58 --> 60). Examples from ASySD_SRSR_Human in table below
In all other cases no difference
Missed journals in ASySD_SRSR_Human
Journal 1
Journal 2
ADHD-ATTENTION DEFICIT AND HYPERACTIVITY DISORDERS
Atten Defic Hyperact Disord
ATLA-ALTERNATIVES TO LABORATORY ANIMALS
Altern Lab Anim
CTS-CLINICAL AND TRANSLATIONAL SCIENCE
Clin Transl Sci
MLTJ-MUSCLES LIGAMENTS AND TENDONS JOURNAL
Muscles Ligaments Tendons J
All these cases are Starting initialism AND all caps AND hyphen between initialism and rest. In Record::addJournals journals in all caps are capitalized. In these cases starting initialism gets lost: "Adhd Attention Deficit And Hyperactivity Disorders".
Not anchoring the pattern to the beginning of the string lets the pattern for journal 2 match journal 1: "Atten.\bDefic.\bHyperact.\bDisord." matches "Adhd Attention Deficit And Hyperactivity Disorders".
Decision
Replacing the alternation with word break only feels better: faster, and misses only one exceptional case
Leaving out the anchoring to the beginning of the string matches more journals (less False Negatives)
The patterns for journals in the DeduplicationService ::compareJournals_... methods originally used alternation "(\b|-|)" and anchored some pattern to the beginning of the string ("^").
Tests
All these cases are Starting initialism AND all caps AND hyphen between initialism and rest. In Record::addJournals journals in all caps are capitalized. In these cases starting initialism gets lost: "Adhd Attention Deficit And Hyperactivity Disorders". Not anchoring the pattern to the beginning of the string lets the pattern for journal 2 match journal 1: "Atten.\bDefic.\bHyperact.\bDisord." matches "Adhd Attention Deficit And Hyperactivity Disorders".
Decision