Open baggiponte opened 3 years ago
Good ideas, thank you! Could you contrast that with how other citation tools implement the advanced search for citation insertion/paper exploration?
I have not properly understood how the extension works, if it's similar to BetterBibTex (which I don't really know a lot about) and/or if it dumps the citations to a json/sqlite which then is queried. I guess this just sends a request to Zotero via the API?
The entire collection gets downloaded locally and stored in IStateDB
and synced when needed (or requested); this is handled by ZoteroClient
which implements IReferenceProvider
interface (so we can have other providers in the future too), roughly these lines are relavant:
It speaks CSL JSON as defined in https://raw.githubusercontent.com/citation-style-language/schema/master/schemas/input/csl-citation.json and https://raw.githubusercontent.com/citation-style-language/schema/master/schemas/input/csl-data which means that parsing dates is... challenging. I think there is some normalizaiton to make it more palatable elswehere in the codebase.
The JSON is then filtered and sorted in various selectors which implement Selector.IModel
interface (the same approach is used for bibliography styles):
O
stands for option, M
for Match.
The default model currently does simple filtering based on title, year, authors, and sorting based on the three + number of citations in the current document to break ties:
This needs writing some unit tests.
Hi Mike, sorry for the late reply - I will investigate the other reference managers after the 15th (cob). Thank you for the explanation - this is really fascinating! Is the IStateDB
a database format for Jupyter Notebooks? Perhaps @retorquere can tells us something more about betterbibtex
?
Also, could the references inserted into a notebook be dumped to a .bib
file? This would make the citation manager perfect to use in combination with jupyter book
!
I'm pretty sure I can, what do you want to know?
Hi @retorquere, thank you for the prompt answer! Full disclosure: I am a newbie in reference managers - I use this jupyterlab-citation-manager
which currently supports Zotero and I some time ago I played a bit with {rbbt}
.
Here's the thing: jupyterlab-citation-manager
is sick, but as of now we can only look up references by their title and with fuzzy search. I do not recall if with bbt
you can also lookup by other tags, say author, and I was wondering how/if you implemented that. Also, @krassowski has underlined how non-trivial it might be to parse dates with CSL JSON: did you have to deal with it?
As a side note, I was also curious to know how you store data: as @krassowski explained above, jupyterlab-citation-manager
dumps the whole Zotero collection to a local IStateDB
. I ask this because of another, unrelated thing - which deserves another issue/feature request on its own, but I wanted to wait before opening another one. I was wondering if bbt
supported the option of dumping all citations in a local file, like references.bib
. The jupyter-book
projects supports building a bibliography from a local .bib
file and jupyterlab-citation-manager
could go hand in hand with it; however, as far as I understand, the bibliography can only be appended at the end of a Notebook.
Thank you!
EDIT: there's also this interesting thread on gitter and should/might be related with #15 and #8 I guess
Here's the thing:
jupyterlab-citation-manager
is sick, but as of now we can only look up references by their title and with fuzzy search. I do not recall if withbbt
you can also lookup by other tags, say author, and I was wondering how/if you implemented that.
It's probably using BBTs JSON-RPC search endpoint, and that passes the work to Zotero quicksearch, which should search on all fields & tags. I'm not sure what differentiates search on "all fields and tags" and "everything" on Zotero, but I'd guess that "everything" includes attachment content.
Also, @krassowski has underlined how non-trivial it might be to parse dates with CSL JSON: did you have to deal with it?
I don't do CSL-JSON date parsing, but I do produce CSL-JSON dates, and they appear to me to be very well-defined structured objects - there really isn't anything to parse in CSL dates AFAICT. Do you have a sample of a hard-to-parse CSL date, @krassowski?
Parsing free-from dates into CSL is another matter. The BBT date parser is a few hundred lines of code on top of two pretty large EDTF-parsing libraries.
As a side note, I was also curious to know how you store data: as @krassowski explained above,
jupyterlab-citation-manager
dumps the whole Zotero collection to a localIStateDB
.
I have an sqlite db in de zotero data directory for most BBT data, and a bunch of JSON files for the caches. These can only be read when Zotero is not running; Zotero locks sqlite databases while it is running, and BBT reads-and-deletes the caches to make sure that if an error occurs that prevents saving the cache would not lead to stale caches being read on next startup; it's better to start with an empty cache (which is a self-repairing situation) than a stale cache. The caches are written back out when Zotero shuts down.
I ask this because of another, unrelated thing - which deserves another issue/feature request on its own, but I wanted to wait before opening another one. I was wondering if
bbt
supported the option of dumping all citations in a local file, likereferences.bib
.
Several ways in fact:
EDIT: there's also this interesting thread on gitter and should/might be related with #15 and #8 I guess
I don't know what the topic under discussion is there.
Thank you for your prompt and exhaustive reply!
I have an sqlite db in de zotero data directory for most BBT data, and a bunch of JSON files for the caches. These can only be read when Zotero is not running; Zotero locks sqlite databases while it is running, and BBT reads-and-deletes the caches to make sure that if an error occurs that prevents saving the cache would not lead to stale caches being read on next startup; it's better to start with an empty cache (which is a self-repairing situation) than a stale cache. The caches are written back out when Zotero shuts down.
That I remember, I had a look inside my ~/Zotero
and found the sqlite files. I guess the choice to use IStateDB
depends on Jupyter.
Several ways in fact:
- You can of course manually export to bibtex
From Zotero, right?
- You can set up an auto-export which will keep the exported bib file in sync with the source library/collection you used to create it
- You can download a library/collection from a web endpoint the BBT makes available ("pull export")
- There is a JSON-RPC endpoint that a program can call to do one-off exports or set up an autoexport
Do these pull-export the whole collection or just the files inside the article/publication?
I don't know what the topic under discussion is there.
Ops, sorry: this is more related to exporting to MyST markdown formats. I am drifting off-topic, I guess I should move this discussion to somewhere else. In the meantime, thank you for the answers, let's wait and see if Mike has something to comment upon.
I don't do CSL-JSON date parsing, but I do produce CSL-JSON dates, and they appear to me to be very well-defined structured objects - there really isn't anything to parse in CSL dates AFAICT. Do you have a sample of a hard-to-parse CSL date, @krassowski?
Parsing free-from dates into CSL is another matter. The BBT date parser is a few hundred lines of code on top of two pretty large EDTF-parsing libraries.
EDTF strings are valid entries of date variables in CSL-JSON schema as is the "structured" form which may take anything between one and three parts which may be strings or numbers and for which the meaning is not very well documented; then you have the extra fields like circa, season, etc and multiple date fields (some records publication date, creation date, etc.; one of those is mandated by CSL if I recall correctly but the thing is it often only contains the year part and all the details are in the other fields and how they are populated appears random to me after looking at a large sample of records from Zotero).
The jupyter-book projects supports building a bibliography from a local .bib file and jupyterlab-citation-manager could go hand in hand with it; however, as far as I understand, the bibliography can only be appended at the end of a Notebook.
This is tracked in the other issues you mentioned - let's keep this issue focused on the search capabilities ;)
Thank you for the explanation - this is really fascinating! Is the
IStateDB
a database format for Jupyter Notebooks?
No, it is more of an in-browser cache for JupyterLab; it is not codified into Jupyter at all. And it might change at any point.
It's probably using BBTs JSON-RPC search endpoint, and that passes the work to Zotero quicksearch, which should search on all fields & tags. I'm not sure what differentiates search on "all fields and tags" and "everything" on Zotero, but I'd guess that "everything" includes attachment content.
I can't figure out the UI for this: do you just write something like author:'Author1 Author2'
and then Zotero quick search searches inside of tags? What about BBT, does this translate in a SELECT * WHERE author IN ('author1', 'author2')
?
No, it is more of an in-browser cache for JupyterLab; it is not codified into Jupyter at all. And it might change at any point.
But you basically query from it, right? So entries have fields and the matter here is just to find a UI query-style consistent with what other citation managers offer, am I correct?
Thank you @retorquere for your generous advice and explaining how BBT works. From that, I gather it relies on a local Zotero installation and access to its local API to pass on the search tasks; @baggiponte as also discussed in #37 this is really not the feasible path for this extension for several reasons:
localhost
or say hpc.myuni.ac.uk
so we cannot even limit the access request to a single domain!)IReferenceProvider
interface (but it will only work for a subset of users and likely confuse newcomers)ISearchProvider
interface to allow using the Zotero Web API to search references but this should be optional and the core functionality has to operate on standard CSL-JSON records directlyNo, it is more of an in-browser cache for JupyterLab; it is not codified into Jupyter at all. And it might change at any point.
But you basically query from it, right? So entries have fields and the matter here is just to find a UI query-style consistent with what other citation managers offer, am I correct?
Yes, the entries follow csl-citation.json
schema (which should be read together with csl-data.json
schema).
EDTF strings are valid entries of date variables in CSL-JSON schema
Oh yeah that's complex so I don't bother doing it myself, I outsource that to a library.
as is the "structured" form which may take anything between one and three parts which may be strings or numbers and for which the meaning is not very well documented; then you have the extra fields like circa, season, etc and multiple date fields (some records publication date, creation date, etc.;
I find these not too hard to process, but TBH I don't support all possible combinations. What I can sensibly output is constrained by the target format (bibtex and biblatex) and since biblatex supports edtf, I just forward whatever is deemed (by said library) to be valid EDTF.
one of those is mandated by CSL if I recall correctly but the thing is it often only contains the year part and all the details are in the other fields and how they are populated appears random to me after looking at a large sample of records from Zotero).
I don't use the Zotero date parser, BBTs date parser differs significantly from Zotero's.
- You can of course manually export to bibtex
From Zotero, right?
Correct. BBT is only available in the Zotero client.
Do these pull-export the whole collection or just the files inside the article/publication?
With files you mean attachments? There's not yet an RPC-JSON endpoint for that. You can pull down bibtex or biblatex from the endpoint.
I can't figure out the UI for this: do you just write something like
author:'Author1 Author2'
and then Zotero quick search searches inside of tags? What about BBT, does this translate in aSELECT * WHERE author IN ('author1', 'author2')
?
I don't translate it at all; I just pass the text on to the same code that handles the quick search above the item list in Zotero, and return the results.
Thank you @retorquere for your generous advice and explaining how BBT works. From that, I gather it relies on a local Zotero installation and access to its local API to pass on the search tasks;
correct
* JupyterHub and other setups will often have no access to the local Zotero installation as it is on a different computer
There's ways around that, but they're not convenient. I have a branch where I work on a BBT that doesn't need the client, but I have no ETA on that beyond "not soon".
* even on the same computer Zotero does not allow to access its API from the browser easily
correct.
* touching the Zotero database on disk directly is not a maintainable implementation for me (it is not a public API in the first place AFAIK)
Only pain lies that way. You most certainly never want to write to the DB directly.
and this extension should not do that so we can in the future install it as an isolated package (say flatpak) with restricted or no access to the disk (which should really be the norm now)
Not a great fan of flatpack et al, but I see the appeal
The extension allows to (fuzzy) search only by the title of the pages; it would be interesting to see tag/field/attribute search, like in Gmail or GitHub issues:
date:YYYY-MM-DD
to filter papers of a certain periodauthor:Surname,Name
And so on. Also, it could be interesting if there could be a way to disable fuzzy searching (say, by typing
\
at the start of a line).I don't understand how difficult this would be to implement, as I have not properly understood how the extension works, if it's similar to BetterBibTex (which I don't really know a lot about) and/or if it dumps the citations to a json/sqlite which then is queried. I guess this just sends a request to Zotero via the API?