refactor parsing to accomodate single line bib docs

jeanetteclark commented 1 month ago

This is in response to #56 and is a pretty significant refactor in the parsing. I definitely did not think things would go as far as they did, but I needed a solution to my problem so here we are.

I've essentially changed things such that each entry is pushed into a single line, and then key-value pairs are parsed according to list of allowed keys (which consists of the standard bibtex + stuff I've been seeing in the wild). I think overall this approach might be more robust, but I really am not sure how flexible we should be about the allowed keys. All of the tests pass at least. I don't expect this PR to be merged as is, since at a minimum we should add a message if a non-allowed value is found and what the value is, and in general that section could probably be handled more robustly.

I opened the PR to see if there was interest in pursing this change, I will not be offended if it is closed outright. Like I said, I just need to get my project working again, and could potentially clean this up to merge if there is interest.

This PR is brought to you by:

giabaio commented 1 month ago

Thank you @jeanetteclark --- I will need to think about it, but certainly won't just close the PR and dismiss it! :wink:

I think perhaps a better way to do this may be to create an ancillary function (say, fix_bib, or something) that takes a non-standard bib file and reformats in the way you suggest? That way, the user would be encouraged to use a standardised bib file (with entries separated by carriage returns after the customary comma), but we specify a wider set of formats that can be worked in, using the utility function?... What do you think?

[UPDATE]: after browsing through your code, I think some of it is very elegant and indeed perhaps more efficient than the original one... (I thought that was pretty clever... though I should say I did not write it and only came into this project quite late as a make-shift maintainer, after I too made a change to fix a bug in how it was working for my own project...).

What about allowing the user an option to add fields that then get added to the other_allowed_fields variable? Like: could you have an extra option in bib2df, say something like: extra_fields, which the user can specify as a vector of strings (the names of the extra fields they want to include).

Say, I include

`bib2df(..., extra_fields=c("project","scopus")

(assume that some of my bib files have a field project, where I specify the project to which a given paper is related and a field scopus, with the URL of my Scopus page, for CV building, or something...).

Now, if the helpers added these to the variable other_allowed_fields, then all should work OK? I think the issue is to make sure that the main function and the helpers communicate and the other_allowed_field is updated with the user-selected ones?

I've created a new branch devel, where I've forked your own PR. If you make further changes, can you push to that branch, so we can test without breaking main?

Thanks!

jeanetteclark commented 1 month ago

Thanks for having a look! Your solution sounds very nice actually, I think it gives us a nice middle ground between being able to parse documents with strange formatting and allowing any field as long as the user provides it. Thanks for moving us over to develop - very happy to switch to a new branch given the scope of the changes

ropensci / bib2df

refactor parsing to accomodate single line bib docs #62