Query Filters - Githubissues

sneakers-the-rat commented 10 months ago

Stemming from: https://github.com/sneakers-the-rat/journal-rss/issues/4

@thesamovar and El (not sure of username!) Both expressed wanting to filter papers out of a feed by keyword.

I dont think we should jump to full text indexing yet, but adding query parameters onto feed generation could be useful as a v2 milestone.

Example syntax: /feeds/{issn}/rss?not=tag:{tagname}&not=author:{authorname}

Questions

[ ] interface: how to present query params? Do we want to build a whole query builder thing, or do we just document how to modify the URLs to add filters?
[ ] scope: which filters? Removing content from a feed is one thing, but implementing Boolean logic is quite another. One might imagine being able to make pretty complex feeds with multiple joined keywords, and that seems cool if its free but also a lot of work if not. Which fields and which operators should we support?
[ ] caching: how will this work with caching? Not expecting a million subscribers per instance, but it does cost something to generate a feed. Can we cache query feeds along with regular feeds in an effective way?

Lets shoot for an example implementation that doesnt really care about perf/caching and see how expensive it is, but it should be possible.

mdingemanse commented 10 months ago

Regarding scope, elsewhere I've seen mention of journal and keywords as primary or privileged parameters, but TBH I rarely look for content in a specific journal and mostly find myself filtering based on title, author and abstract.

The idea of being able to subscribe to a feed that is filtered by a set of authors (like a list in fediverse) regardless of journals really fires me up. After all, I don't know where say the next Abeba Birhane, Lucy Suchman or N.J. Enfield paper will appear but I do know I want to read it.

sneakers-the-rat commented 10 months ago

YES. The goal for me is increasingly to make this arbitrary across parameters, and so lets call journal and keywords "low hanging fruit" to reach that end. As we go, since its such a lightweight problem I think we should be looking for ways to abstract basic logics for making feeds, so for v1 we may be making routes for specific params, v2 we might shoot for code generation for all params! Very much intended to be hackable, scope limitations are more for ordering work than hard limits on what we should consider :)

Edit: there are tricky questions about making feeds for people in particular that we would want to be mindful of, but part of the design intention here is that someone would be able to run this on their own s.t. making an author-specific feed doesnt necessarily need to mean the equivalent of a "@name@bird.makeup" kind of question, but one could privately tailor feeds for themselves as well when that gray area becomes apparent

roaldarbol commented 10 months ago

Does Crossref include ORCID IDs for authors? If that's the case, then that might be a good place to start... I know not everyone have ORCID, but it might be easier than trying to follow e.g. "Marc Johnson".

sneakers-the-rat commented 10 months ago

Crossref has them sometimes - we could add ORCID as a data source and make it possible to make feeds by ORCID though :)

roaldarbol commented 10 months ago

Sounds good! Especially if we have a drop down for the search bar - I'll make that its own issue. But It does complicate the entire search process, as the tables have to change their layout. For v1, let's stick to just journal, and then for v2 we can think about how an implementation of this could look.

EDIT: Search bar in #21.

mdingemanse commented 10 months ago

@roaldarbol @sneakers-the-rat yes I agree it would be good to start with ORCID, also because that represents the closest we currently have to an opt-in, author-approved mechanism for authoritative per-author publication lists. A fulltext search on author is likely to be much more noisy as I don't think crossref does author name disambiguation (but correct me if I'm wrong).

Makes a lot of sense to postpone this for a v2 release though. Thanks for responding!

roaldarbol commented 10 months ago

Worth noting that OpenAlex just launched their web interface, and it's pretty great I think! There, it seems they also have a unique identifier for each author - and also has an ORCID field, so it might be easier to do if/when we implement support for OpenAlex.

sneakers-the-rat commented 10 months ago

Hell yes. Ive been working on the AP side (long way round, making a graph db ORM first) but I think that'll take too long to wait on, so after I finish some deadline obligations this week im going to merge the PR for the first openalex stuff, do the ORCID feeds, and then lets deploy an instance at feeds.neuromatch.social. ive never done the "CD" part of "CI/CD" so it might be a nice learning opportunity too :)

sneakers-the-rat / paper-feeds

Query Filters #5