Open kousu opened 2 years ago
The matches would be improved, too, in our case by doing them case insensitively. So, e.g., rewrite difflib.get_close_matches
to
--- /tmp/a 2021-10-16 19:36:44.887957974 -0400
+++ /tmp/b 2021-10-16 19:36:54.873261759 -0400
@@ -33,9 +33,9 @@
raise ValueError("cutoff must be in [0.0, 1.0]: %r" % (cutoff,))
result = []
s = SequenceMatcher()
- s.set_seq2(word)
+ s.set_seq2(word.lower())
for x in possibilities:
- s.set_seq1(x)
+ s.set_seq1(x.lower())
if s.real_quick_ratio() >= cutoff and \
s.quick_ratio() >= cutoff and \
s.ratio() >= cutoff:
or if not using get_close_matches()
directly but instead, say, using linear_sum_assignment()
, at least make sure to call .lower()
before computing the ratios. And another optimization -- effective for us but maybe not for every situation? -- is to write ratio()
something like
def ratio(a,b):
if a == b: return 1.0
return SequenceMatcher(None, a, b).ratio()
because we have a lot of exact matches and ==
is faster than SequenceMatcher
Another optimization might be to clamp pairs below a threshold to 0.0? Maybe that doesn't save any time though.
Another optimization is to install https://github.com/miohtama/python-Levenshtein and use its drop-in-compatible SequenceMatcher
. It's written in C so it's much faster at string searching.
id_ccv = df_ccv[df_ccv['Title'].isin(difflib.get_close_matches(row['Title'], df_ccv['Title']))].index.values
EDIT: you need to use
id_ccv = df_ccv[df_ccv['Title'].isin(difflib.get_close_matches(row['Title'],df_ccv_unmatched['Title']))].index.values
and also patch
- id_ccv = np.array(id_ccv_single)
+ id_ccv = np.array([id_ccv_single])
I tried this out. I ran
bibeasy -x CCV.xml
on master
and on my edit, which I'm not posting as a branch because it's just a quick prototype to see how it would affect the output. Here's the diff:
So the main change was that it matched 9 articles it previously didn't recognize as duplicates. I haven't looked at which ones they are yet (and might not bother) but I guess that's a reasonable thing for it to do?
-Results for type: 'article': Found: 141 | Not in CCV: 10 | Duplicate: 0 | Not in Gsheet: 12
+Results for type: 'article': Found: 141 | Not in CCV: 1 | Duplicate: 9 | Not in Gsheet: 12
It was also a lot slower. It took double the time. I think the inner .isin()
is killing the runtime, because get_close_matches
is an O(n) search, isin()
is an O(n) search, and so the whole thing is O(2*n^2).
these are great investigations and suggestions, Nick. Exactly addressing the limitations on the implementation that i’ve been aware of for many years but never took the time to address them.
This should be helpful for feeding the linear-solver:
def quick_quick_ratio(a, b, threshold=0.6):
"""
*Quickly* compute the similarity ratio of two strings.
The speed comes from using approximations: if the ratio
is below the given threshold return 0.0 early. This means
that strings that are obviously different do not need the
full computation performed.
Most uses of ratio() are interested in *similar* strings not
*dissimilar* ones.
Equivalent to:
difflib.SequenceMatcher(None, a, b).ratio() \
if difflib.SequenceMatcher(None, a, b).ratio() >= threshold \
else 0.0
"""
if a == b: return 1.0 # TODO: test if having this here is faster or slower
m = difflib.SequenceMatcher(None, a, b)
if m.real_quick_ratio() < threshold: return 0.0
if m.quick_ratio() < threshold: return 0.0
r = m.ratio()
if r < threshold: return 0.0
return r
I'm getting ahead of myself here, but I worked out a prototype. I first did a pure-python version and then I used scipy.optimize.linear_sum_assignment
; they get the same results on the current citation database, but linear_sum_assignment
is faster and, in theory, more accurate in the case of ambiguous data.
This gets me:
kousu@ail:~/src/neuropoly/bibeasy$ python3 fuzzy.py
precomputing scores: 0%| | 0/168350 [00:00<?, ?it/s]
precomputing scores: 3%|▎ | 4214/168350 [00:00<00:03, 42135.62it/s]
precomputing scores: 5%|▍ | 8344/168350 [00:00<00:03, 41881.39it/s]
precomputing scores: 8%|▊ | 12812/168350 [00:00<00:03, 42679.94it/s]
precomputing scores: 10%|█ | 16998/168350 [00:00<00:03, 42420.40it/s]
precomputing scores: 12%|█▏ | 20342/168350 [00:00<00:03, 39256.81it/s]
precomputing scores: 15%|█▍ | 24975/168350 [00:00<00:03, 41138.56it/s]
precomputing scores: 18%|█▊ | 29822/168350 [00:00<00:03, 43090.90it/s]
precomputing scores: 20%|██ | 34099/168350 [00:00<00:03, 42991.45it/s]
precomputing scores: 23%|██▎ | 38723/168350 [00:00<00:02, 43913.76it/s]
precomputing scores: 26%|██▌ | 42982/168350 [00:01<00:02, 43046.42it/s]
precomputing scores: 28%|██▊ | 47198/168350 [00:01<00:02, 42114.70it/s]
precomputing scores: 31%|███ | 51352/168350 [00:01<00:02, 40921.44it/s]
precomputing scores: 33%|███▎ | 55411/168350 [00:01<00:03, 37287.28it/s]
precomputing scores: 35%|███▌ | 59179/168350 [00:01<00:03, 35686.66it/s]
precomputing scores: 37%|███▋ | 62790/168350 [00:01<00:03, 33859.63it/s]
precomputing scores: 40%|████ | 67686/168350 [00:01<00:02, 37310.90it/s]
precomputing scores: 43%|████▎ | 72159/168350 [00:01<00:02, 39263.31it/s]
precomputing scores: 45%|████▌ | 76461/168350 [00:01<00:02, 40312.85it/s]
precomputing scores: 48%|████▊ | 80770/168350 [00:01<00:02, 41107.34it/s]
precomputing scores: 51%|█████ | 85687/168350 [00:02<00:01, 43231.79it/s]
precomputing scores: 54%|█████▎ | 90089/168350 [00:02<00:01, 42660.10it/s]
precomputing scores: 56%|█████▌ | 94412/168350 [00:02<00:01, 40598.04it/s]
precomputing scores: 59%|█████▊ | 98533/168350 [00:02<00:01, 40233.82it/s]
precomputing scores: 61%|██████ | 102600/168350 [00:02<00:01, 40116.68it/s]
precomputing scores: 63%|██████▎ | 106642/168350 [00:02<00:01, 39504.59it/s]
precomputing scores: 66%|██████▌ | 111487/168350 [00:02<00:01, 41816.61it/s]
precomputing scores: 69%|██████▉ | 116242/168350 [00:02<00:01, 43379.39it/s]
precomputing scores: 72%|███████▏ | 120632/168350 [00:02<00:01, 43199.89it/s]
precomputing scores: 74%|███████▍ | 125207/168350 [00:03<00:00, 43931.47it/s]
precomputing scores: 77%|███████▋ | 129629/168350 [00:03<00:00, 42810.35it/s]
precomputing scores: 81%|████████ | 135801/168350 [00:03<00:00, 47143.12it/s]
precomputing scores: 85%|████████▌ | 143413/168350 [00:03<00:00, 53219.47it/s]
precomputing scores: 91%|█████████ | 152873/168350 [00:03<00:00, 61258.19it/s]
precomputing scores: 95%|█████████▍| 159694/168350 [00:03<00:00, 62467.31it/s]
precomputing scores: 99%|█████████▉| 166675/168350 [00:03<00:00, 64502.33it/s]
precomputing scores: 100%|██████████| 168350/168350 [00:03<00:00, 45922.91it/s]assign1: 11.565089895999336
assign2: 0.38203748100022494
assign3: 0.20900208400053089
344 137 6
344 137 6
344 137 6
extending assign3 @ 0.8: 0.0023871119992691092
344 137 6
extending assign3 @ 0.7: 0.0019260959998064209
344 137 6
extending assign3 @ 0.6: 0.0018989520012837602
344 137 6
0.9963898916967509
'Spatial correspondence of spinal cord white matter tracts using diffusion tensor imaging, fibre tractography, and atlas-based segmentation' =>
'Spatial correspondence of spinal cord white matter tracts using diffusion tensor imaging, fibre tractography, and atlas-based segmentation.'
0.9961685823754789
'Multivariate combination of magnetization transfer ratio and quantitative T2* to detect subpial demyelination in multiple sclerosis' =>
'Multivariate combination of magnetization transfer ratio and quantitative T2* to detectsubpial demyelination in multiple sclerosis'
0.9960159362549801
'Convolutional neural network based segmentation of the spinal cord and intramedullary injury in acute blunt spinal cord trauma' =>
'Convolutional neural network based segmentation of the spinal cord and intramedullary injury in acute blunt spinal cor trauma'
0.9952153110047847
'Fully automated segmentation of the cervical spinal cord using PropSeg: application to multiple sclerosis' =>
'Fully automated segmentation of the cervical spinal cord using PropSeg:application to multiple sclerosis'
0.9947089947089947
'A simple and robust method for automating analysis of naïve and regenerating peripheral nerves' =>
'A simple and robust method for automating analysis of naïve and regenerating peripheral nerves.'
0.9945945945945946
'Automatic multiclass intramedullary spinal cord tumor segmentation on MRI with deep learning' =>
'Automatic multiclass intramedullary spinal cord tumor segmentation on MRI with deep learning.'
0.9943502824858758
'The R1-weighted connectome: complementing brain networks with a myelin-sensitive measure' =>
'The R1-weighted connectome: complementing brain networks with a myelin-sensitive measure.'
0.9940828402366864
'Translating AxCaliber on a clinical system : 600mT/m versus optimized 80mT/m protocol' =>
'Translating AxCaliber on a clinical system: 600mT/m versus optimized 80mT/m protocol'
0.9935897435897436
'Approches en imagerie par resonance magnétique pour l’étude de l’impact de la rigidité artérielle sur la matière blanche du cerveau chez les personnes âgées' =>
'Approches en imagerie par résonance magnétique pour l’étude de l’impact de la rigidité artérielle sur la matière blanche du cerveau chez les personnes âgées'
0.9935064935064936
'Comparison of cervical cord results from a quantative 3D multi-parameter mapping (MPM) protocol of the whole brain with a dedicated cervical cord protocol' =>
'Comparison of cervical cord results from a quantitive 3D multi-parameter mapping (MPM) protocol of the whole brain with a dedicated cervical cord protocol'
0.9932885906040269
'A unified signal readout improves denoising of multi- modal spinal cord MRI' =>
'A unified signal readout improves denoising of multi-modal spinal cord MRI'
0.9924242424242424
'Validation of a 2D Spinal Cord probabilistic atlas: Application to FA measurement and VBM study of the GM atrophy occurring with age' =>
'Validation of a 2D Spinal Cord probabilistic atlas. Application to FA measurement and VBM study of the GM atrophy occurring with age'
0.9916666666666667
'A 24-channel shim array for real-time shimming of the human spinal cord: Characterization and proof-of-concept experiment' =>
'A 24-channel shimarray for real-time shimming of the human spinal cord: Characterization and proof-of-conceptexperiment'
0.98989898989899
'Quantitative magnetic resonance imaging of spinal cordmicrostructure in adults with cerebral palsy' =>
'Quantitative magnetic resonance imaging of spinal cord microstructure in adults with cerebral palsy.'
0.975
'Fully-\xadintegrated T1\u200b, T2\u200b, T2\u200b*, white and gray matter atlases of the spinal cord' =>
'Fully-\xadintegrated T1,T2, T2*, white and gray matter atlases of the spinal cord'
0.966542750929368
'Spinal Cord Morphology in Degenerative Myelopathy Patients; Assessing Key Morphological Characteristics Using Machine Vision Tools' =>
'Spinal Cord Morphology in Degenerative Cervical Myelopathy Patients; Assessing Key Morphological Characteristics Using Machine Vision Tools'
0.964824120603015
'PropSeg: automatic spinal cord segmentation method for MR images using propagated deformation models' =>
'PropSeg: automatic spinal cord segmentation method for MR images using propagated deformable models'
0.9608938547486033
'Atlas-based Quantification of DTI measures in Typically Developing Pediatric Spinal Cord' =>
'Atlas-Based Quantification of DTI Measures in a Typically Developing Pediatric Spinal Cord.'
0.9285714285714286
'TouchMe - Distribution and Management of Medical Videos in Clinical Routine' =>
'Distribution and Management of Medical Videos in Clinical Routine'
0.9117647058823529
'Diffusion MRI reveals tract-specific microstructural correlates of electrophysiological impairments in non-myelopathic and myelopathic spinal cord compression' =>
'Diffusion magnetic resonance imaging reveals tract-specific microstructural correlates of electrophysiological impairments in non-myelopathic and myelopathic spinal cord compression.'
0.9101796407185628
'Quantitative 7-Tesla imaging of cortical myelin changes in early multiple sclerosis' =>
'Quantitative 7-Tesla Imaging of Cortical Myelin Changes in Early Multiple Sclerosis.'
A unmatched: '7T MRI of cortical and spinal cord pathology in MS'
A unmatched: '7T MRI of the cerebral cortex'
A unmatched: '7T MRI of the healthy and pathological cerebral cortex'
A unmatched: '7T MRI of the spinal cord'
A unmatched: 'A Robust Methodology for T1 Mapping'
A unmatched: 'A comprehensive structural characterization of the Sapap3 knockout mouse for repetitive behaviours'
A unmatched: 'AI Helps Doctors Detect MS In the Spinal Cord'
A unmatched: 'Accelerated Diffusion Spectrum Imaging with Compressed Sensing using Adaptive Dictionaries'
A unmatched: 'Acquisition and image processing methods'
A unmatched: 'Advanced Techniques in Imaging Specific to Degenerative Myelopathy'
A unmatched: 'Advanced Techniques in Imaging specific to degenerative myelopathy'
A unmatched: 'Advanced Tools for Spinal Cord Imaging in MS'
A unmatched: 'Advanced image approaches to assess the spinal cord'
A unmatched: 'Advanced spinal cord imaging'
A unmatched: 'Advances in acquisition and analysis of neuro MRI: Quantifying microstructure in the spinal cord'
A unmatched: 'Advances in acquisition and analysis of neuro MRI: Special focus to quantify microstructure in the spinal cord'
A unmatched: 'Application of the general linear model to hemodynamic response estimation in diffuse optical imaging'
A unmatched: 'Artificial Intelligence for Multiple Sclerosis'
A unmatched: 'Association between cortical demyelination and structural connectomics in early multiple sclerosis'
A unmatched: 'Automatic segmentation of spinal multiple sclerosis lesions: How to generalize across MRI contrasts?'
A unmatched: 'Bound Pool Fractions Complement Diffusion Measures in Characterizing White Matter Micro and Macrostructure'
A unmatched: 'Challenges & Solutions for spinal cord imaging'
A unmatched: 'Challenges of spinal cord imaging, and superior analysis techniques using the Spinal Cord Toolbox.'
A unmatched: 'Coil arrays'
A unmatched: 'Comparison of DTI and Q-Ball imaging metrics in a cat model of spinal cord injury'
A unmatched: 'Connecting MRI physics and A.I. to advance neuroimaging'
A unmatched: 'Connecting MRI physics and A.I. to advance neuroimaging'
A unmatched: 'Connecting MRI physics and A.I. to advance neuroimaging'
A unmatched: 'Connecting physics and deep learning to generalize medical image analysis tasks'
A unmatched: 'Cortical surface and depth analysis of T2* in the human brain'
A unmatched: 'DW-MRI and fMRI of the spinal cord'
A unmatched: 'Deep Active Learning for Myelin Segmentation on Histology Data'
A unmatched: 'Detection of multiple pathways in the spinal cord white matter using q-ball imaging'
A unmatched: 'Diffusion & functional MRI of the spinal cord'
A unmatched: 'Diffusion MRI of the spinal cord '
A unmatched: 'Diffusion Tensor Imaging and tractography of the spinal cord in animals and humans'
A unmatched: 'Diffusion Tensor Imaging: Principles and Applications'
A unmatched: 'Diffusion and functional MRI of the spinal cord'
A unmatched: 'Diffusion-weighted imaging'
A unmatched: 'Early Detection of Neurological Disorders with Imaging Biomarkers'
A unmatched: 'Effectiveness of regional diffusion MRI measures in distinguishing multiple sclerosis abnormalities within the cervical spinal cord'
A unmatched: 'Evaluation of distortion correction methods in diffusion MRI of the spinal cord'
A unmatched: 'FMRI and DTI of the spinal cord: Methodological issues'
A unmatched: 'FMRI of the spinal cord'
A unmatched: 'Frontier Neuroscientific applications of spinal MRI'
A unmatched: 'Functional Magnetic Resonance Imaging'
A unmatched: 'Giant leaps forward in spinal cord imaging'
A unmatched: 'High resolution diffusion MRI'
A unmatched: 'High-Resolution DWI in Brain and Spinal Cord with syngo RESOLVE'
A unmatched: 'High-resolution spinal cord fMRI & DWI'
A unmatched: 'High-resolution spinal cord fMRI & DWI'
A unmatched: 'How to correct susceptibility artifacts in fMRI and DTI?'
A unmatched: 'How to detect BOLD responses in spinal cord fMRI? Acquisition, pre-processing and statistical issues'
A unmatched: 'How to fix magnetic field inhomogeneities in MRI?'
A unmatched: 'IRM multi-paramétrique du système nerveux central.'
A unmatched: 'IRM quantitative cérébrospinale à 3T et 7T'
A unmatched: 'Imaging Spinal Cord Microstructure with MRI'
A unmatched: 'Imaging spinal cord injury and white matter damage'
A unmatched: 'Impact of realignment on spinal functional MRI time series'
A unmatched: 'Improving HARDI Acquisition'
A unmatched: 'Improving the Accuracy of Cross-relaxation Imaging'
A unmatched: 'In vivo histology of the human spinal cord using MRI'
A unmatched: 'In vivo histology of the spinal cord with MRI'
A unmatched: 'In vivo histology with MRI'
A unmatched: 'In vivo histology with MRI'
A unmatched: 'In vivo histology with MRI'
A unmatched: 'In vivo histology with MRI'
A unmatched: 'In vivo histology with MRI'
A unmatched: 'In vivo histology with MRI'
A unmatched: 'In vivo histology with MRI'
A unmatched: 'In vivo histology with MRI'
A unmatched: 'In vivo histology with MRI'
A unmatched: 'In vivo histology with MRI'
A unmatched: 'In vivo histology with MRI and validation techniques: Application to create large-scale atlases of microstructure in the central nervous system'
A unmatched: 'In vivo histology with MRI: Special focus on the spinal cord'
A unmatched: 'In vivo histology with ultra-high field MRI'
A unmatched: 'Intracortical laminar pathology in the motor cortex is associated with proximal underlying white matter injury in multiple sclerosis: a multimodal 7 T and 3 T MRI study'
A unmatched: 'Intracortical laminar pathology in the motor cortex is associated with tractographically connected white matter injury in multiple sclerosis: a multimodal 7T and 3T MRI study'
A unmatched: 'Knowledge modeling in image guided neurosurgery: application in understanding intra-operative brain shift'
A unmatched: 'Large-scale atlases of microstructure in the central nervous system: methodology and application to the spinal cord'
A unmatched: 'Les capacités d’apprentissage du système nerveux'
A unmatched: 'MRI biomarkers for the spinal cord'
A unmatched: 'MRI coil apparatus and method'
A unmatched: 'MRI of the spinal cord: from white matter organization to neuronal activity'
A unmatched: 'MS: Distribution of Cervical Spine Lesion and Clinical Status'
A unmatched: 'Magnetic Resonance Imaging of the Injured Spinal Cord: The Present and the Future'
A unmatched: 'Measuring within the Voxel: Brain Tissue Volume in Individual Subjects.'
A unmatched: 'Methodology for MR diffusion tensor imaging of the cat spinal cord'
A unmatched: 'Modélisation des connaissances en neurochirurgie guidée par l’image : application à l’étude des déformations anatomiques intra-opératoires'
A unmatched: 'Multi-Parametric Cervical Spinal Cord MRI Provides an Accurate Diagnostic Tool for Detecting Clinical Myelopathy'
A unmatched: 'Multi-parametric MRI of the spinal cord'
A unmatched: 'Multi-parametric MRI of the spinal cord'
A unmatched: 'Multi-parametric MRI of the spinal cord'
A unmatched: 'Multiclass Spinal Cord Tumor Segmentation on MRI with Deep Learning'
A unmatched: 'Neuroimaging and AI: What do we need, what is out there, how can we do better'
A unmatched: 'Neuroimaging and AI: What do we need, what is out there, how can we do better'
A unmatched: 'Neuroimaging and AI: What do we need, what is out there, how can we do better'
A unmatched: 'New Imaging Techniques for Spinal Cord Microstructure'
A unmatched: 'New Imaging Techniques for Spine and Plexus'
A unmatched: 'New advances in DTI of the spinal cord'
A unmatched: 'On the Accuracy of T1 Mapping: Searching for Common Ground.'
A unmatched: 'Overview of myelin mapping and validation techniques'
A unmatched: 'Platforms, neuroinformatics and data solutions'
A unmatched: 'Practical Clinical Applications of MR relaxometry'
A unmatched: 'Q-ball imaging of the spinal cord'
A unmatched: 'Real time shimming with hybrid AC/DC coil technology'
A unmatched: 'Reproducibility and Evolution of Diffusion MRI Measurements within the Cervical Spinal Cord in Multiple Sclerosis'
A unmatched: 'Sex differences in corpus callosum fractional anisotropy in schizophrenia patients: A pilot tractography study using Diffusion Tensor Imaging'
A unmatched: 'Spinal Cord Imaging: Diffusion & Ultra-High Field'
A unmatched: 'Spinal Cord Toolbox for MRI: application to spinal cord injury'
A unmatched: 'Spinal Cord Toolbox. '
A unmatched: 'Spinal MR: what multiparametric MR can add'
A unmatched: 'Spinal cord fMRI'
A unmatched: 'Spinal cord imaging for ALS, MS, and spinal cord injury'
A unmatched: 'Spinal cord imaging: some investigations'
A unmatched: 'Spine intervertebral disc labeling using a fully convolutional redundant counting model'
A unmatched: 'Standardization of acquisition and data processing in spinal cord MRI: Application in degenerative cervical myelopathy.'
A unmatched: 'Standardizing acquisition and processing of spinal cord MRI data'
A unmatched: 'Standardizing acquisition and processing of spinal cord MRI data'
A unmatched: 'Standardizing acquisition and processing of spinal cord MRI data'
A unmatched: 'Standardizing acquisition and processing of \u2028spinal cord quantitative MRI data'
A unmatched: 'State of the art for spinal cord images acquisition and processing'
A unmatched: 'Steady-state MRI: Methods for Neuroimaging.'
A unmatched: 'Straightening the spinal cord using fiber tractography'
A unmatched: 'Susceptibility artifacts in DTI of the spinal cord'
A unmatched: 'T2* mapping and B0 orientation-dependence of the in vivo human cortex at 7T'
A unmatched: 'Template-based analysis of multi-parametric MRI data using the Spinal Cord Toolbox'
A unmatched: 'The Role of Diffusion Tensor Imaging in Assessment of the Spinal Cord Injury'
A unmatched: 'Translating State-Of-The-Art Spinal Cord MRI Techniques To Clinical Use: A Systematic Review Of Clinical Studies Utilizing DTI, MT, MWF, MRS, and fMRI'
A unmatched: 'Validation of Microstructural Modeling in Spinal Cord Imaging'
A unmatched: 'Validation of a cord atrophy measurements method in motor neuron diseases'
A unmatched: 'Vertebral labeling on MRI using deep learning techniques'
A unmatched: 'Visualizing Spinal Cord Damage using MRI'
A unmatched: 'What are strong magnets and strong gradients good for?'
A unmatched: 'What are strong magnets and strong gradients good for?'
A unmatched: "Why isn't my 7T MRI showing 7T everywhere and why is that a problem? Real-time shimming with hybrid AC/DC coil technology"
A unmatched: 'ivadomed: A Medical Imaging Deep Learning Toolbox'
B unmatched: "A Cross-Sectional Study on the Impact of Arterial Stiffness on the Corpus Callosum, a Key White Matter Tract Implicated in Alzheimer's Disease"
B unmatched: 'H, Descoteaux M, Deriche R, Benali H, Rossignol S. Comparison of DTI and Q-Ball imaging metrics in a cat model of spinal cord injury.'
B unmatched: 'Injury volume extracted from MRI predicts neurologic outcome in acute spinal cord injury: A prospective TRACK-SCI pilot study'
B unmatched: 'Quantitative MRI of the Spinal Cord'
B unmatched: "Quel est le seuil de risque de la rigidité artérielle associé à l'intégrité de la substance blanche du cerveau des personnes âgées?"
B unmatched: 'Tract-specific diffusion MRI relates to the predictors of myelopathy in degenerative cervical spinal cord compression'
344 137 6
It's already pretty accurate. It only missed 6, and of those I'm pretty sure 2 are genuine misses (and can be found by turning the threshold down from .9 to .7) and the rest are genuine mismatches.
When I work in the idea of a multi-field score()
I'm sure it will be even more reliable; maybe not any more accurate though, I think it's already at 100% accuracy for this dataset.
I went down a rabbit hole and want to pin this idea in case someone wants to take it up later: optimal fuzzy matches between the gsheet and CCV databases (or potentially any other database formats we support in the future).
find_matching_ref()
does a basic JOIN then uses a simple heuristic to handle rare conflicts:https://github.com/jcohenadad/bibeasy/blob/637635b5e14ccc30bf71a9af491a1e8484393864/bibeasy/utils.py#L163-L177
replace_ref_in_text()
goes its own way and usesisin()
to give a degree of toleration:https://github.com/jcohenadad/bibeasy/blob/637635b5e14ccc30bf71a9af491a1e8484393864/bibeasy/utils.py#L403-L405
Both work well enough because the data is fairly accurate, and
find_matching_ref
warns about any data it doesn't understand so they can be manually corrected.As the database grows and time to maintain it shortens though this gets harder.
There's a generic solution to this in the literature: use
difflib.SequenceMatcher.ratio()
(or another kind of fuzzy matcher; there's other metrics available infuzzywuzzy
) to get a score for every pair of titles and pick the best. For example, you could directly replace the line infind_matching_ref
with:which would then be able to tolerate misspellings, changes in capitalization, abbreviations, and punctuation variations. The current code will warn that all of these situations are "MISSED".
We can even match on multiple fields using this trick: average their scores:
sum(ratio(df_ccv[row,field],row[field]) for field in fields))/len(fields)
(you could also play with using a weighted average, so that say the Publisher is worth half as much as the Year which is three-quarters as much as the Title. But to get that feature you need to rewritedifflib.get_close_matches()
.And we can do even better than that. So far what I've described allows for collisions; they are rare, but they can happen if the same article was published in multiple venues with small title variations. Optimally, we want a best guess for the complete database of what goes to what, with no redundant matchings. This is the Assignment Problem, and there's a solution here available as
scipy.optimize.linear_sum_assignment()
.