internetarchive / openlibrary

One webpage for every book ever published!
https://openlibrary.org
GNU Affero General Public License v3.0
5.11k stars 1.34k forks source link

Inconsistent title processing by `catalog.merge.merge_marc.build_titles()` #2410

Closed hornc closed 9 months ago

hornc commented 5 years ago

openlibrary.catalog.merge.merge_marc.build_titles() inconsistent results:

The final period reduces the number of results and combinations. Expected result: the period should be stripped, and the title combinations should be effectively the same for matching purposes.

Also, the duplicate titles in B (version without final period) should be removed.

    a = 'A test full title : subtitle (parens).'
    b = 'A test full title : subtitle (parens)'
A: 
build_titles(a)
{
'normalized_title': 'a test full title subtitle (parens)', 
'full_title': 'A test full title : subtitle (parens).', 

'titles': [
'A test full title : subtitle (parens).',
'a test full title subtitle (parens)',
'test full title : subtitle (parens).',
'test full title subtitle (parens)']

, 'short_title': 'a test full title subtitl'}

B: 
build_titles(b)
{'normalized_title': 'a test full title subtitle (parens)', 'full_title': 'A test full title : subtitle (parens)', 

'titles': [
'A test full title : subtitle (parens)', 
'a test full title subtitle (parens)', 
'test full title : subtitle (parens)', 
'test full title subtitle (parens)', 
'A test full title : subtitle', 
'a test full title subtitle', 
'a test full title subtitle', 
'a test full title subtitle', 
'test full title : subtitle', 
'test full title subtitle', 
'test full title subtitle', 
'test full title subtitle'

], 'short_title': 'a test full title subtitl'}
xayhewalo commented 4 years ago

Assigning @hornc per slack discussion since MARC issues are his perview

hornc commented 1 year ago

Stalled on this because it's not clear what the impact is, if any. OL has bigger import problems, this was just something I noticed that seemed inconsistent while reviewing code.

I'm not sure if this is likely to falsely determine two titles are not the same (a missed match), of if it's simply processing duplicate titles in a way that is inefficient. Maybe it's a sign there are duplicate methods trying to do the same thing in slightly different ways?

First step: write some tests and review the expected results.

The final period reduces the number of results and combinations. Expected result: the period should be stripped, and the title combinations should be effectively the same for matching purposes.

that I wrote in the report sounds pretty straightforward as an expectation...

tfmorris commented 1 year ago

The associated PR says that this will be used with non-MARC records which is a significant issue because there's really no telling how a title from Amazon (or god forbid BWB) will be formatted whereas metadata from a MARC record has a high probability of having the various title, subtitle, series, etc elements correctly identified. An Amazon record could be : : <subtitle> or almost any other combination.</p> </div> </div> <div class="page-bar-simple"> </div> <div class="footer"> <ul class="body"> <li>© <script> document.write(new Date().getFullYear()) </script> Githubissues.</li> <li>Githubissues is a development platform for aggregating issues.</li> </ul> </div> <script src="https://cdn.jsdelivr.net/npm/jquery@3.5.1/dist/jquery.min.js"></script> <script src="/githubissues/assets/js.js"></script> <script src="/githubissues/assets/markdown.js"></script> <script src="https://cdn.jsdelivr.net/gh/highlightjs/cdn-release@11.4.0/build/highlight.min.js"></script> <script src="https://cdn.jsdelivr.net/gh/highlightjs/cdn-release@11.4.0/build/languages/go.min.js"></script> <script> hljs.highlightAll(); </script> </body> </html>