inspirehep / refextract

Extract bibliographic references from (High-Energy Physics) articles.
GNU General Public License v2.0
130 stars 30 forks source link

extract_references_from_file returns inconsistent data #104

Open Hu1buerger opened 1 year ago

Hu1buerger commented 1 year ago

Given this document Kotti et al. - 2023 - Machine Learning for Software Engineering A Terti.pdf

Expectation

i would expect

  1. per linenumber only one reference to be found
  2. even if it returns a reference object for the same line it should hold that r1 = r2 with r1, r2 from the refs each with the same lineno and especially that r1['title'] = r2['title']

Actual

the refs found contain multiple contradictory results.

ie. Screenshot 2023-05-10 at 15 06 09

Replicate me

install pytest-subtests. call with the document attached above

#with subtests from pytest-subtests
def test_reference_consistency(path, subtests):
    """
    Ensure that for each line in the file, there are no inconsistent duplicate references.

    Given a list of references, there shall only exist two references r1 and r2 where r1.lineno = r2.lineno and r1 == r2.
    """
    refs = extract_references_from_file(path)

    # Group the references by line number
    lines = {}
    for ref in refs:
        lineno = ref['linemarker'][0]
        if lineno in lines:
            lines[lineno].append(ref)
        else:
            lines[lineno] = [ref]

    # Check for inconsistent duplicate references on each line
    consistency = True

    for lineno, refs in lines.items():
        if len(refs) == 1:
            continue

        assert len(refs) > 1

        with subtests.test('line', lineno=lineno, refs=refs) as st:
            # Check that each pair of references on the line are consistent duplicates
            for i in range(1, len(refs)):
                ref1 = refs[i - 1]
                ref2 = refs[i]

                assert r1 == r2, f"Found inconsistent references: {r1} and {r2}"