Inconsistent title processing by `catalog.merge.merge_marc.build_titles()`

hornc commented 5 years ago

openlibrary.catalog.merge.merge_marc.build_titles() inconsistent results:

The final period reduces the number of results and combinations. Expected result: the period should be stripped, and the title combinations should be effectively the same for matching purposes.

Also, the duplicate titles in B (version without final period) should be removed.

    a = 'A test full title : subtitle (parens).'
    b = 'A test full title : subtitle (parens)'
A: 
build_titles(a)
{
'normalized_title': 'a test full title subtitle (parens)', 
'full_title': 'A test full title : subtitle (parens).', 

'titles': [
'A test full title : subtitle (parens).',
'a test full title subtitle (parens)',
'test full title : subtitle (parens).',
'test full title subtitle (parens)']

, 'short_title': 'a test full title subtitl'}

B: 
build_titles(b)
{'normalized_title': 'a test full title subtitle (parens)', 'full_title': 'A test full title : subtitle (parens)', 

'titles': [
'A test full title : subtitle (parens)', 
'a test full title subtitle (parens)', 
'test full title : subtitle (parens)', 
'test full title subtitle (parens)', 
'A test full title : subtitle', 
'a test full title subtitle', 
'a test full title subtitle', 
'a test full title subtitle', 
'test full title : subtitle', 
'test full title subtitle', 
'test full title subtitle', 
'test full title subtitle'

], 'short_title': 'a test full title subtitl'}

xayhewalo commented 4 years ago

Assigning @hornc per slack discussion since MARC issues are his perview

hornc commented 1 year ago

Stalled on this because it's not clear what the impact is, if any. OL has bigger import problems, this was just something I noticed that seemed inconsistent while reviewing code.

I'm not sure if this is likely to falsely determine two titles are not the same (a missed match), of if it's simply processing duplicate titles in a way that is inefficient. Maybe it's a sign there are duplicate methods trying to do the same thing in slightly different ways?

First step: write some tests and review the expected results.

The final period reduces the number of results and combinations. Expected result: the period should be stripped, and the title combinations should be effectively the same for matching purposes.

that I wrote in the report sounds pretty straightforward as an expectation...

[ ] formalize that expectation in some test cases and make the code pass

tfmorris commented 1 year ago

The associated PR says that this will be used with non-MARC records which is a significant issue because there's really no telling how a title from Amazon (or god forbid BWB) will be formatted whereas metadata from a MARC record has a high probability of having the various title, subtitle, series, etc elements correctly identified. An Amazon record could be : : <subtitle> or almost any other combination.</p> </div> </div> <div class="page-bar-simple"> </div> <div class="footer"> <ul class="body"> <li>© <script> document.write(new Date().getFullYear()) </script> Githubissues.</li> <li>Githubissues is a development platform for aggregating issues.</li> </ul> </div> <script src="https://cdn.jsdelivr.net/npm/jquery@3.5.1/dist/jquery.min.js"></script> <script src="/githubissues/assets/js.js"></script> <script src="/githubissues/assets/markdown.js"></script> <script src="https://cdn.jsdelivr.net/gh/highlightjs/cdn-release@11.4.0/build/highlight.min.js"></script> <script src="https://cdn.jsdelivr.net/gh/highlightjs/cdn-release@11.4.0/build/languages/go.min.js"></script> <script> hljs.highlightAll(); </script> </body> </html>

internetarchive / openlibrary

Inconsistent title processing by `catalog.merge.merge_marc.build_titles()` #2410