Detect duplicate or similar publications from database. This project aim to reduce size of the database by showing pairs of suspect duplications, to help citation easier and cleaner.
Export database as CSV file without header, with these fields:
For example, if your exported CSV named publications.csv
, then run it with:
python3 report.py publications.csv
From example_input.csv
, this is the result:
| score | id-1 | id-2 |
|---------|------------------------|------------------------|
| 100.00% | cross-publisher-2 | cross-publisher-3 |
| 100.00% | cross-publisher-1 | cross-publisher-3 |
| 100.00% | cross-publisher-1 | cross-publisher-2 |
| 100.00% | arXiv-v3 | arXiv-v4 |
| 100.00% | arXiv-v1 | arXiv-v2 |
| 80.00% | arXiv-v2 | arXiv-v4 |
| 80.00% | arXiv-v2 | arXiv-v3 |
| 80.00% | arXiv-v1 | arXiv-v4 |
| 80.00% | arXiv-v1 | arXiv-v3 |