jrollins / xapers

Personal journal article management and indexing system
https://gitlab.com/jrollins/xapers
Other
7 stars 3 forks source link

Userstory: Get DOI (or other external public ID) of my local document #10

Open pfalcon opened 3 years ago

pfalcon commented 3 years ago

As a keeper of a local personal collection (read: minimal, universally useful metadata, like only the titles of documents), sometimes I may want to share a reference to a particular document with the world. I'd like to use more formal document identifier than just a title. So, I'd need a way to look up DOI, arXiv id, or similar identifier for a doc, as well as URL form of that id, which other people can visit.

pfalcon commented 3 years ago

Looking at the current source code, this functionality is not available.

jrollins commented 3 years ago

This is available in the curses UI. xapers show id:..., then Alt-U to "yank" the document source URL.

We should provide a way to print doc source URLs from the CLI as well.

pfalcon commented 3 years ago

Thanks for the response.

Interesting. But I don't see how that would work, given the contents of https://gitlab.com/jrollins/xapers/-/blob/master/xapers/sources/doi.py . There's no function like query_possible_ids_by_title(). Again, I have just 1960 Recursive Programming - Dijkstra.pdf, and I'd like a tool which would be able to find a DOI (etc.) by just that. (Oh, and of course, there's no DOI in the document text ;-) ).

jrollins commented 3 years ago

Sources identifiers (i.e. DOI IDs) have to be added to individual documents. See the add command. If your document is id:1, then you could do something like:

$ xapers add --source=doi:... id:1

Usually when I add a new document I do so by adding the PDF and the DOI at the same time:

$ xapers add --source=doi:... --file=/path/to/pdf

I've been trying to streamline the interface, and it will be improved in the next release.

jrollins commented 3 years ago

Interesting. But I don't see how that would work, given the contents of https://gitlab.com/jrollins/xapers/-/blob/master/xapers/sources/doi.py . There's no function like query_possible_ids_by_title().

This is not the appropriate place to look. That file just describes how to interact with a remote source. When a document is indexed the source metadata is indexed as well, and searches are done through the internal xapian database.

pfalcon commented 3 years ago

Sources identifiers (i.e. DOI IDs) have to be added to individual documents.

Umm, no... ;-). Not as far as this user story is concerned. It explicitly says "user doesn't add that boring metadata, instead software automates adding it" (a user can supervise it, actually, that's implied - I for one don't want some stupid AI to contaminate handcrafted metadata of my 1K docs collection).

Usually when I add a new document I do so by adding the PDF and the DOI at the same time:

Umm, and I don't. And that's where conceptual difference between my papersman and xapers lie: my software is "local-first". It's intended to be run by a mere human for their mere-human needs. It's intended to be run by humans who have no idea what DOI is, and couldn't care less. But when, years later, they possibly learn what the heck DOI is, and think that they need some, the software should help to get them (not require humans to enter them manually).

jrollins commented 3 years ago

hrm, sorry, I was just describing how xapers works, not what you personally should or should not do. xapers needs to know the source of the paper to retrieve it's metadata, and there's no good way to figure that out other than by asking the user to supply it. If you have some other suggestion about how xapers could learn the metadata other than through a bunch of fragile ad hoc heuristics, i would be thrilled to learn.

I feel like you may be jumping to conclusions about what xapers is intended to be. It is absolutely a personal paper management system, intended for "mere humans". But for it to be really useful metadata is needed, and xapers has to learn about it somehow. what i absolutely did not want was for xapers to require the user to enter all the metadata manually. that would be a non-starter. Most journals support DOI, which contains all the metadata in a structured format. So that is by far the easiest way to get the metadata into xapers. It's just a single URL, that is clearly provided in most articles. Other source identifiers are supported as well (such as arxiv).

I suggest trying the interactive add option, which scans documents for source identifiers and presents the user with suggestions for which source ID might be appropriate.

I fully acknowledge that your papers might not have DOI or other sources that are supported by xapers. the source support is modular, so users can easily add their own source modules, and i would be happy to include new ones in xapers.

pfalcon commented 3 years ago

xapers needs to know the source of the paper to retrieve it's metadata, and there's no good way to figure that out other than by asking the user to supply it

Yeah, I know ;-). That's why my cute system doesn't do that ("retrieve metadata"), and I'm looking for an alternative which might do that ("without asking the user") before jumping to implement it myself.

If you have some other suggestion about how xapers could learn the metadata other than through a bunch of fragile ad hoc heuristics, i would be thrilled to learn.

Fragility depends on a particular case. In my collection for example, all papers have full title and pub year (both as part of the filename, I renamed them manually, and that's as much as I'm willing to do manually). Then, from 15min of research yesterday, https://www.crossref.org/ appears to be a service allowing title -> DOI mapping. They have API: https://api.crossref.org/works?query=1960%20Recursive%20Programming%20-%20Dijkstra&filter=until-pub-date:1960 . Bad news for that link is that it returns whole bunch of results. Good news is that #0 is exactly what's needed. That's my plan on how to tackle the problem so far.

I feel like you may be jumping to conclusions about what xapers is intended to be.

I'm definitely trying to build conceptual model of xapers, and have a bunch of hypotheses. I try to avoid jumping to conclusions, but instead present my usecases, discuss, query, suggest...

Most journals support DOI

Even if that's true, what about those which don't? My collection now has 1000+ papers, and 500 of them not having embedded DOI doesn't go with me well. But I doubt "most" is even remotely true. Most of my papers are definitely author preprints and by definition don't have DOI, which gets assigned by publisher when an article is published.

I suggest trying the interactive add option

But I don't start creating my papers collection with xapers. I do have my papers collection, and it grew to such a size that I need tool(s) to help me manage it. In particular, I don't need a tool which will try to "own" my data. And that's another big conceptual difference of my papersman and any other similar tool I spotted so far (including xapers) - they try to create some opaque "database" behind user's back and "host" user's data there. That's google-syndrome - to try to hoard user's data behind their back (and we know what they use that data for - to then spy after users). Such approach so discredited itself (user control/migration/error recovery/longevity) that some people are wary of any attempt to put their data into database without them explicitly asking so ;-).

pfalcon commented 3 years ago

Btw, I decided to bite the bullet and give a try to another tool I had long in my queue - Zotero. It waited very long in my queue because I knew that my minimalist aspirations unlikely will be satisfied by its big GUI bloat, but... we need to fish those DOIs somehow, right.

As expected, it's pretty cool in its GUIshness. It's also hilariously adhoc at places. But I notice that even it has got it semi-right: Stored Files and Linked Files. I.e. it will definitely own your metadata (but it will own it in sqlite database, which is not bad choice at all to re-own it back), but at least it curbed its appetites regarding owning user data - you can tell it "hands off my data files, link to them, don't try to own them".

Oh, and of course, it retrieves metadata (like DOI) automatically whenever you throw a PDF at it ;-).

All in all, I'm glad I finally gave it a try. That finally answers a question "why there's no not just a clear leader, but even active projects at all in this 'paper management' area". Because there's a clear leader - Zotero. And everyone who's seriously in this area is apparently on it for years already (current major version of Zotero, 5, was initially released in 2017).

So, I guess any tool which wants to work in this area (I mean for nuts like me who can't just use only Zotero) should seriously consider interoperability with Zotero.

pfalcon commented 3 years ago

For reference, posted on Reddit regarding what people use to organize their personal libraries: https://old.reddit.com/r/ProgrammingLanguages/comments/lbxblp/meta_my_proglangcompilertheoryproganalysis/ . The hypothesis was that majority will reply "Zotero". And indeed, that was a response immediately posted, but in the end, it can't be said that it's a def-facto standard.

Today also stumbled upon https://github.com/neuml/paperai "AI-powered literature discovery and review engine for medical/scientific papers". That's also exactly kinda of buzzwords I was looking for in regard to DOI/other metadata acquisition (Everyone understands it will be of subpar quality, but that's exactly the reason to spend own time on that and reuse others' toys) ;-).