typst / biblatex

A Rust crate for parsing and writing BibTeX and BibLaTeX files.
Apache License 2.0
120 stars 15 forks source link

Repeated keys in input .bib files #18

Closed ariroffe closed 2 years ago

ariroffe commented 2 years ago

I was running some tests and found the following.

I manually counted that tests/gral.bib has 85 entries. But after running parse the resulting Bibliography has 83 entries. I tracked down 2 entries that have repeated keys in the file, these are:

@inproceedings{kim2009,
  title={Sewersnort: A drifting sensor for in-situ sewer gas monitoring},
  author={Kim, Jihyoung and Lim, Jung Soo and Friedman, Jonathan and Lee, Uichin and Vieira, Luiz and Rosso, Diego and Gerla, Mario and Srivastava, Mani B},
  booktitle={2009 6th Annual IEEE Communications Society Conference on Sensor, Mesh and Ad Hoc Communications and Networks},
  pages={1--9},
  year={2009},
  organization={IEEE}
}

@INPROCEEDINGS{kim2009,
  author={J. {Kim} and J. S. {Lim} and J. {Friedman} and U. {Lee} and L. {Vieira} and D. {Rosso} and M. {Gerla} and M. B. {Srivastava}},
  booktitle={2009 6th Annual IEEE Communications Society Conference on Sensor, Mesh and Ad Hoc Communications and Networks},
  title={SewerSnort: A Drifting Sensor for In-situ Sewer Gas Monitoring},
  year={2009},
  volume={},
  number={},
  pages={1-9}
}

and:

@article{ishihara2012,
  title={Active node selection in flowing wireless sensor networks},
  author={Ishihara, Susumu and Sato, Daisuke},
  journal={Proc. 6th ICMU},
  pages={8--15},
  year={2012}
}

@inproceedings{ishihara2012,
  title={Active node selection in flowing wireless sensor networks},
  author={Ishihara, Susumu and Sato, Daisuke},
  booktitle={Proceedings of the 6th International Conference on Mobile Computing and Ubiquitous Networking (IMCU)},
  pages={8--15},
  year={2012}
}

The parse method just keeps the second one. I guess this is because of how the insert method works in a HashMap / BTreeMap.

Is this intended behavior? I don't think parse should assume that the latter one is the correct entry, perhaps the user wishes to keep the first one if asked, or to change one key and keep both. I see two possible solutions:

1- parse errors out in this case

2- keep both entries, but automatically put a (1) after the key of the second one (or (2) if there are already 2 of the same, etc.). This solution would also play nicely with my library, since handling this type of repetitions is precisely what it is designed to do. If we go this route, I have a similar function for doing that (see fn get_new_citation_key in line 173). I could adapt it and make a pull request here.

reknih commented 2 years ago

As per your PR, you chose the first option of returning an error, which might be a sensible pick.

However, that very much interferes with how you try to use biblatex in your project. I'm not a fan of the second option of arbitrarily messing with the user input. The alternative I'd like to propose is this:

parse and from_raw do not return errors for duplicate entries. All well-formed entries in the file will be pushed into the Bibliography::entries vector, but the record in the Bibliography::keys map will always point to the first occurrence. This way, methods like Bibliography::get will always return the first entry but the parsing process will not fail with duplicates. Library consumers can iterate over Bibliography::entries (including duplicates) by calling Bibliography::iter to provide their own duplicate handling.

This way, biblatex makes no assumptions about the consumer's requirements for dupe handling and the like.

To facilitate this usage pattern, I can add into_iter and into_vec methods on Bibliography.

Would these changes solve your issue?

ariroffe commented 2 years ago

I'm actually happy with the changes. I got a bit carried away when I opened the issue, but I don't think biblatex should evolve thinking about how I, in particular, intend to use it.

Cite keys should be unique (according to the bibtex specification). A bib file with a duplicate key is malformed. If we are returning an error and failing to parse when a file is malformed due to an incorrect field, then we should treat other flaws in the same way, keeping the library uniform. Hence, it should error out on duplicate keys. Would you agree? (If so, feel free to close this issue)

P.S. please remember to bump up the version and update in crates.io so that I can use the library with these changes :)