Starting page 1 is not always useful for deduplication:
can be starting page number within an article number. All these articles start with page number 1
can be the starting page range of a book / report. All books / reports start with page 1, this is not discriminating. See issue #2 for the use of ending page
BTW: Leaving out the starting page number with books / reports forces higher JWS thresholds which is good for reports (e.g. Natl.Toxicol.Program.Tech.Rep.Ser.).
Test
Adding the wiping of page 1 has mixed results:
some test files have slightly lower False Negatives
some test files have slightly higher False Positives, only ASySD_Depression has 50% less FPs (15 instead of 32).
Given the awkward format of the ASySD_Depression file, and the (slightly) higher FPs: NOT IMPLEMENTED
Starting page 1 is not always useful for deduplication:
BTW: Leaving out the starting page number with books / reports forces higher JWS thresholds which is good for reports (e.g. Natl.Toxicol.Program.Tech.Rep.Ser.).
Test
Adding the wiping of page 1 has mixed results:
Given the awkward format of the ASySD_Depression file, and the (slightly) higher FPs: NOT IMPLEMENTED