Attributes/tags search - Githubissues

krassowski / jupyterlab-citation-manager

Citation Manager for JupyterLab using Zotero Web API

BSD 3-Clause "New" or "Revised" License

66 stars 2 forks source link

Attributes/tags search #36

Open baggiponte opened 3 years ago

baggiponte commented 3 years ago

The extension allows to (fuzzy) search only by the title of the pages; it would be interesting to see tag/field/attribute search, like in Gmail or GitHub issues:

date:YYYY-MM-DD to filter papers of a certain period
author:Surname,Name

And so on. Also, it could be interesting if there could be a way to disable fuzzy searching (say, by typing \ at the start of a line).

I don't understand how difficult this would be to implement, as I have not properly understood how the extension works, if it's similar to BetterBibTex (which I don't really know a lot about) and/or if it dumps the citations to a json/sqlite which then is queried. I guess this just sends a request to Zotero via the API?

krassowski commented 3 years ago

Good ideas, thank you! Could you contrast that with how other citation tools implement the advanced search for citation insertion/paper exploration?

I have not properly understood how the extension works, if it's similar to BetterBibTex (which I don't really know a lot about) and/or if it dumps the citations to a json/sqlite which then is queried. I guess this just sends a request to Zotero via the API?

The entire collection gets downloaded locally and stored in IStateDB and synced when needed (or requested); this is handled by ZoteroClient which implements IReferenceProvider interface (so we can have other providers in the future too), roughly these lines are relavant:

https://github.com/krassowski/jupyterlab-citation-manager/blob/bc366a8f7ce0a9f9340c33c982da125d815da7b5/src/zotero.ts#L122-L273

It speaks CSL JSON as defined in https://raw.githubusercontent.com/citation-style-language/schema/master/schemas/input/csl-citation.json and https://raw.githubusercontent.com/citation-style-language/schema/master/schemas/input/csl-data which means that parsing dates is... challenging. I think there is some normalizaiton to make it more palatable elswehere in the codebase.

The JSON is then filtered and sorted in various selectors which implement Selector.IModel interface (the same approach is used for bibliography styles):

https://github.com/krassowski/jupyterlab-citation-manager/blob/bc366a8f7ce0a9f9340c33c982da125d815da7b5/src/components/selector.tsx#L28-L34

O stands for option, M for Match.

The default model currently does simple filtering based on title, year, authors, and sorting based on the three + number of citations in the current document to break ties:

https://github.com/krassowski/jupyterlab-citation-manager/blob/bc366a8f7ce0a9f9340c33c982da125d815da7b5/src/components/citationSelector.tsx#L136-L213

This needs writing some unit tests.

baggiponte commented 2 years ago

Hi Mike, sorry for the late reply - I will investigate the other reference managers after the 15th (cob). Thank you for the explanation - this is really fascinating! Is the IStateDB a database format for Jupyter Notebooks? Perhaps @retorquere can tells us something more about betterbibtex?

Also, could the references inserted into a notebook be dumped to a .bib file? This would make the citation manager perfect to use in combination with jupyter book!

retorquere commented 2 years ago

I'm pretty sure I can, what do you want to know?

baggiponte commented 2 years ago

Hi @retorquere, thank you for the prompt answer! Full disclosure: I am a newbie in reference managers - I use this jupyterlab-citation-manager which currently supports Zotero and I some time ago I played a bit with {rbbt}.

Here's the thing: jupyterlab-citation-manager is sick, but as of now we can only look up references by their title and with fuzzy search. I do not recall if with bbt you can also lookup by other tags, say author, and I was wondering how/if you implemented that. Also, @krassowski has underlined how non-trivial it might be to parse dates with CSL JSON: did you have to deal with it?

As a side note, I was also curious to know how you store data: as @krassowski explained above, jupyterlab-citation-manager dumps the whole Zotero collection to a local IStateDB. I ask this because of another, unrelated thing - which deserves another issue/feature request on its own, but I wanted to wait before opening another one. I was wondering if bbt supported the option of dumping all citations in a local file, like references.bib. The jupyter-book projects supports building a bibliography from a local .bib file and jupyterlab-citation-manager could go hand in hand with it; however, as far as I understand, the bibliography can only be appended at the end of a Notebook.

Thank you!

EDIT: there's also this interesting thread on gitter and should/might be related with #15 and #8 I guess

retorquere commented 2 years ago

Here's the thing: jupyterlab-citation-manager is sick, but as of now we can only look up references by their title and with fuzzy search. I do not recall if with bbt you can also lookup by other tags, say author, and I was wondering how/if you implemented that.

It's probably using BBTs JSON-RPC search endpoint, and that passes the work to Zotero quicksearch, which should search on all fields & tags. I'm not sure what differentiates search on "all fields and tags" and "everything" on Zotero, but I'd guess that "everything" includes attachment content.

Also, @krassowski has underlined how non-trivial it might be to parse dates with CSL JSON: did you have to deal with it?

I don't do CSL-JSON date parsing, but I do produce CSL-JSON dates, and they appear to me to be very well-defined structured objects - there really isn't anything to parse in CSL dates AFAICT. Do you have a sample of a hard-to-parse CSL date, @krassowski?

Parsing free-from dates into CSL is another matter. The BBT date parser is a few hundred lines of code on top of two pretty large EDTF-parsing libraries.

As a side note, I was also curious to know how you store data: as @krassowski explained above, jupyterlab-citation-manager dumps the whole Zotero collection to a local IStateDB.

I have an sqlite db in de zotero data directory for most BBT data, and a bunch of JSON files for the caches. These can only be read when Zotero is not running; Zotero locks sqlite databases while it is running, and BBT reads-and-deletes the caches to make sure that if an error occurs that prevents saving the cache would not lead to stale caches being read on next startup; it's better to start with an empty cache (which is a self-repairing situation) than a stale cache. The caches are written back out when Zotero shuts down.

I ask this because of another, unrelated thing - which deserves another issue/feature request on its own, but I wanted to wait before opening another one. I was wondering if bbt supported the option of dumping all citations in a local file, like references.bib.

Several ways in fact:

You can of course manually export to bibtex
You can set up an auto-export which will keep the exported bib file in sync with the source library/collection you used to create it
You can download a library/collection from a web endpoint the BBT makes available ("pull export")
There is a JSON-RPC endpoint that a program can call to do one-off exports or set up an autoexport

EDIT: there's also this interesting thread on gitter and should/might be related with #15 and #8 I guess

I don't know what the topic under discussion is there.

baggiponte commented 2 years ago

Thank you for your prompt and exhaustive reply!

I have an sqlite db in de zotero data directory for most BBT data, and a bunch of JSON files for the caches. These can only be read when Zotero is not running; Zotero locks sqlite databases while it is running, and BBT reads-and-deletes the caches to make sure that if an error occurs that prevents saving the cache would not lead to stale caches being read on next startup; it's better to start with an empty cache (which is a self-repairing situation) than a stale cache. The caches are written back out when Zotero shuts down.

That I remember, I had a look inside my ~/Zotero and found the sqlite files. I guess the choice to use IStateDB depends on Jupyter.

Several ways in fact:

You can of course manually export to bibtex

From Zotero, right?

You can set up an auto-export which will keep the exported bib file in sync with the source library/collection you used to create it

You can download a library/collection from a web endpoint the BBT makes available ("pull export")

There is a JSON-RPC endpoint that a program can call to do one-off exports or set up an autoexport

Do these pull-export the whole collection or just the files inside the article/publication?

I don't know what the topic under discussion is there.

Ops, sorry: this is more related to exporting to MyST markdown formats. I am drifting off-topic, I guess I should move this discussion to somewhere else. In the meantime, thank you for the answers, let's wait and see if Mike has something to comment upon.

krassowski commented 2 years ago

I don't do CSL-JSON date parsing, but I do produce CSL-JSON dates, and they appear to me to be very well-defined structured objects - there really isn't anything to parse in CSL dates AFAICT. Do you have a sample of a hard-to-parse CSL date, @krassowski?

Parsing free-from dates into CSL is another matter. The BBT date parser is a few hundred lines of code on top of two pretty large EDTF-parsing libraries.

EDTF strings are valid entries of date variables in CSL-JSON schema as is the "structured" form which may take anything between one and three parts which may be strings or numbers and for which the meaning is not very well documented; then you have the extra fields like circa, season, etc and multiple date fields (some records publication date, creation date, etc.; one of those is mandated by CSL if I recall correctly but the thing is it often only contains the year part and all the details are in the other fields and how they are populated appears random to me after looking at a large sample of records from Zotero).

krassowski commented 2 years ago

The jupyter-book projects supports building a bibliography from a local .bib file and jupyterlab-citation-manager could go hand in hand with it; however, as far as I understand, the bibliography can only be appended at the end of a Notebook.

This is tracked in the other issues you mentioned - let's keep this issue focused on the search capabilities ;)

krassowski commented 2 years ago

Thank you for the explanation - this is really fascinating! Is the IStateDB a database format for Jupyter Notebooks?

No, it is more of an in-browser cache for JupyterLab; it is not codified into Jupyter at all. And it might change at any point.

baggiponte commented 2 years ago

It's probably using BBTs JSON-RPC search endpoint, and that passes the work to Zotero quicksearch, which should search on all fields & tags. I'm not sure what differentiates search on "all fields and tags" and "everything" on Zotero, but I'd guess that "everything" includes attachment content.

I can't figure out the UI for this: do you just write something like author:'Author1 Author2' and then Zotero quick search searches inside of tags? What about BBT, does this translate in a SELECT * WHERE author IN ('author1', 'author2')?

No, it is more of an in-browser cache for JupyterLab; it is not codified into Jupyter at all. And it might change at any point.

But you basically query from it, right? So entries have fields and the matter here is just to find a UI query-style consistent with what other citation managers offer, am I correct?

krassowski commented 2 years ago

Thank you @retorquere for your generous advice and explaining how BBT works. From that, I gather it relies on a local Zotero installation and access to its local API to pass on the search tasks; @baggiponte as also discussed in #37 this is really not the feasible path for this extension for several reasons:

JupyterHub and other setups will often have no access to the local Zotero installation as it is on a different computer
even on the same computer Zotero does not allow to access its API from the browser easily due to the security implications; the block is implemented on CORS level and only possible to circumvent by:
- developing a browser extension (as Zotero Google Docs integration) which is a tremendous work, subpar UX (now the user has to have the Jupyter server extension, Jupyter frontend extension AND browser extension installed AND give it access to the contents of the websites they open - and we don't know if they will use Jupyter on localhost or say hpc.myuni.ac.uk so we cannot even limit the access request to a single domain!)
- developing a proxy on the Jupyter server extension which I discussed in #37; this remains a possibility for an alternative implementation of our IReferenceProvider interface (but it will only work for a subset of users and likely confuse newcomers)
by design, this extension is intended to interface with multiple citation providers, subject to API availability, so it should not rely on any Zotero-specific features; we could create an ISearchProvider interface to allow using the Zotero Web API to search references but this should be optional and the core functionality has to operate on standard CSL-JSON records directly
touching the Zotero database on disk directly is not a maintainable implementation for me (it is not a public API in the first place AFAIK) and this extension should not do that so we can in the future install it as an isolated package (say flatpak) with restricted or no access to the disk (which should really be the norm now)

krassowski commented 2 years ago

No, it is more of an in-browser cache for JupyterLab; it is not codified into Jupyter at all. And it might change at any point.

But you basically query from it, right? So entries have fields and the matter here is just to find a UI query-style consistent with what other citation managers offer, am I correct?

Yes, the entries follow csl-citation.json schema (which should be read together with csl-data.json schema).

retorquere commented 2 years ago

EDTF strings are valid entries of date variables in CSL-JSON schema

Oh yeah that's complex so I don't bother doing it myself, I outsource that to a library.

as is the "structured" form which may take anything between one and three parts which may be strings or numbers and for which the meaning is not very well documented; then you have the extra fields like circa, season, etc and multiple date fields (some records publication date, creation date, etc.;

I find these not too hard to process, but TBH I don't support all possible combinations. What I can sensibly output is constrained by the target format (bibtex and biblatex) and since biblatex supports edtf, I just forward whatever is deemed (by said library) to be valid EDTF.

one of those is mandated by CSL if I recall correctly but the thing is it often only contains the year part and all the details are in the other fields and how they are populated appears random to me after looking at a large sample of records from Zotero).

I don't use the Zotero date parser, BBTs date parser differs significantly from Zotero's.

You can of course manually export to bibtex

From Zotero, right?

Correct. BBT is only available in the Zotero client.

Do these pull-export the whole collection or just the files inside the article/publication?

With files you mean attachments? There's not yet an RPC-JSON endpoint for that. You can pull down bibtex or biblatex from the endpoint.

I can't figure out the UI for this: do you just write something like author:'Author1 Author2' and then Zotero quick search searches inside of tags? What about BBT, does this translate in a SELECT * WHERE author IN ('author1', 'author2')?

I don't translate it at all; I just pass the text on to the same code that handles the quick search above the item list in Zotero, and return the results.

Thank you @retorquere for your generous advice and explaining how BBT works. From that, I gather it relies on a local Zotero installation and access to its local API to pass on the search tasks;

correct

* JupyterHub and other setups will often have no access to the local Zotero installation as it is on a different computer

There's ways around that, but they're not convenient. I have a branch where I work on a BBT that doesn't need the client, but I have no ETA on that beyond "not soon".

* even on the same computer Zotero does not allow to access its API from the browser easily

correct.

* touching the Zotero database on disk directly is not a maintainable implementation for me (it is not a public API in the first place AFAIK)

Only pain lies that way. You most certainly never want to write to the DB directly.

and this extension should not do that so we can in the future install it as an isolated package (say flatpak) with restricted or no access to the disk (which should really be the norm now)

Not a great fan of flatpack et al, but I see the appeal