Entry tags and metadata

balki commented 3 years ago

Currently two bits of user data can be added to a feed entry (mark_as_read, mark_as_important).

Possible use cases:

Add user notes about the entry
Add tags for entry, add entry to 'saved items'
In case of podcasts, download info, (i.e path to the file, if successfully downloaded or how many times tried to download)

Optionally include user_data to search (as an argument to make_reader).

lemon24 commented 3 years ago

Hello, and thank you for reaching out!

A quick disclaimer first, to set expectactions: Since I'm working on stuff based on my needs/interests and in my free time, I am not sure I will actually implement any of these any time soon. Also, I am reluctant to add features without multiple strong use cases, since aside from the initial developement, each new feature increases maintenance burden.

If you're ok with that, keep reading :)

The way you listed the use cases is spot-on, since while they all have to do with entry user data, they are separate features (they have different requirements and would be implemented quite differently).

I added my thoughts on each of them below, along with some possible alternatives and some clarifying questions. Please let me know if the alternatives help, and which of the features are more important to you (so I can prioritize).

User text fields (searchable)

Actions that should be possible with user text fields:

add/get/remove
full-text search by
(maybe) filter entries by has

It should be possible to implement this using the existing entry search infrastructure.

Currently, each entry has one or more rows like the following in the search index (columns that identify which entry they belong to omitted):

title	feed	content	_path
Entry title	Feed title	Short summary	.summary
Entry title	Feed title	Longer content	.content[0].value

To allow limiting the search to the notes from the query itself, they should be a different column:

title	feed	content	note	_path
Entry title	Feed title	Short summary	null	.summary
Entry title	Feed title	Longer content	null	.content[0].value
Entry title	Feed title	null	User note	.user_notes[0].value

Then, a query like hello note: great would mean entries that match "hello" anywhere, and at least one of their notes matches "great".

It would also be useful for the Search interface to allow for more granular updates as well. Currently, it can only do "update the search index for everything that needs updating"; it would be nice to allow the user to "update the search index for this specific entry I just added a note to".

TBD: What metadata do we store for notes (date added etc.)? Are notes conceptually an array / set, or a dict with a user-provided key (like feed metadata is)?

Entry tags

This would be similar to the existing feed tags (for consistency).

Actions that should be possible with entry tags:

add/get/remove
filter entries by
get entry counts filtered by

To add this, I'd like to collect at least a few new use cases for individual tags.

Alternatives/questions

read and important can be thought of as tags as well; they're separate things now because of performance reasons and to simplify the implementation (very early on, they were actually implemented as tags).

Currently, reader does not associate any semantics to read and important (that is, you can use them for anything, and present them in the UI in any way).

read is obviously useful for what the name implies.
I use important in the web app to mean something like important / favorite / star / saved (when I added important, I wasn't sure if these are different from one another, so I just added a single flag).

Can you use important for the "saved item" use case you describe? If not, why? (Do you plan to use it for something else? If yes, what is the difference between "important" and "saved"?)

Arbitrary entry metadata

This would be similar to the existing feed metadata (for consistency).

Actions that should be possible with entry metadata:

add/get/remove

To add this, I'd like to collect at least a few new use cases.

Alternatives/questions

Here are my thoughts for your specific "download info" use case:

I mainly see this use case as a sort of cache / preloader for enclosures.

Since the file itself is stored externally, it makes sense to store metadata about it externally as well. (If the file gets deleted, is that metadata still valid/useful?)

Path: I would use a standard way of generating a path to the file, something derived from the unique identifiers of the feed / entry / enclosure, for example <MD5 hash of feed URL>/<MD5 hash of entry id>/<MD5 hash of enclosure URL>. Alternatively, using the <MD5 hash of enclosure URL> alone might be enough – if two entries have the same enclosure, do you need to download the file twice? In either case, you'd need to find a scheme that gets around OS file-per-directory limitation (schemes like a0/2d/a02dab1856badd8c01d18047ae58cd46 come to mind). For increased readability, it may be useful to add some fragment of said id after the hash (sanitized because of OS character and length limits), e.g. <MD5 hash of enclosure URL>-original-name.mp3.
Download success: The pattern browsers or some torrent clients use might be acceptable: download the file to filename.ext.part; after the download finishes successfully, rename it to filename.ext.
How many times the enclosure was downloaded (and other rarely-used arbitrary metadata about enclosures): A filename.ext.meta JSON file might be enough for this.

Some downsides of the approach described above are that it's hard / inefficient to:

get download data with the entry in the same query / reader call (did enclosures for these (potentially lots of) entries finish downloading?)
- note that if enclosures are shown paginated, checking directly for file existence may be good/fast enough
filter by download data (give me all the entries for which enclosures have finished downloading)

Are there valid/useful use cases for these two things?

lemon24 commented 3 years ago

I've linked the the comment above from the dev notes page, so I don't forget about it.

I'll be closing this for now.

@balki, please re-open if you have any comments (and again, thank you for requesting this).

For anyone reading this in the future, if you need any of the features described above, either cut another issue or re-open this one.

balki commented 3 years ago

Apologies for late reply. First brief background. I am writing a podcast application for which I was investigating if this library will serve as data layer. it needed other 'feed' like features like 'marking as read', 'getting new entries', etc,. but one thing that was missing was that I have to save the local path where I download the mp3 file. Even I keep that information in a different database, I will need to store some linking information here (like foreign key). E.g. from this library, I can get all the unread entries, get their foreign keys, get the file paths from other db and then start playing them. I can also do the reverse like take guid from the feed entry and store it in the other db as a unique key.

But if I could completely avoid the other db and store all the extra information here in the 'user data' field (serialized as json), it would make management easier. If I am anyway going to have other db, I am more inclined to do the feed maintenance myself as well instead of maintain two different databases.

since while they all have to do with entry user data, they are separate features (they have different requirements and would be implemented quite differently).

IMHO, In a lot cases I believe application will want to save some data for each entry. It is impossible to predict all use-cases and support in the library level. It is a common pattern in software libraries. e.g. when registering a callback with a library, it usually provides a field called 'client_data' (a void * in a c library) which is arbitrary data specific to the application

One of my other rss scripts, does this, gets entries from rss, sends the links to a telegram channel. It does all the 'feed' like features like fetch rss, mark processed ones as read, update entries. And then it saves how many 👍, 👎, 😁 reactions each post gets. If I had to use this library for that, I wouldn't have anywhere to save that info. If I save a json info file for each post, it would just be too much file opening/closing/reading/writing on every click.

If you decide to implement, I suggest the following,

on feed entry: getUserData, setUserData
on channel: an optional argument extract_search_text which is a function which takes user data and returns a text that should be added to search index.

lemon24 commented 3 years ago

Thank you for explaining your use case more and which features you need most.

Arbitrary feed metadata

The C lib "user data" pointer comparison sold me on the arbitrary metadata bit; it also helps that it's probably easiest to implement.

For symmetry with the feed API, it'll look exactly like it (it makes it easier to learn/use):

iter_entry_metadata(entry, *, key=None) -> [(key, value), ...]
get_entry_metadata(entry, key, default=no value) -> value or default
set_entry_metadata(entry, key, value)
delete_entry_metadata(entry, key)

This is similar to what you proposed above, but instead of exactly one piece of user data, there can be many.

Can you confirm this covers your use case? (If yes, I'll cut a new issue for it.)

As mentioned before, I can make no promises of when this will happen (sometime in the next 6 months?). I am willing to accept pull requests and/or guide you though implementing this, though (it should be relatively easy, since it's similar to the existing feeds feature). Please let me know.

Searchable text fields

I was also thinking to reuse metadata for the searchable text fields, and having a way of signalling "this field of this metadata should be indexed".

I have to think a bit more about how to implement this, though.

From what I understand about your use case, this is lower priority than the entry metadata (and depends on it in any case).

(The discussion below is more to clarify my thoughts.)

The arbitrary callable may add some issues due to how things are indexed (that is, storage and search are separate components by design, and the way they communicate is a bit convoluted). This is both about performance and consistency.

For example: What happens if you have 2 Reader instances, where extract_search_text returns different fields, and use them alternatively? They may return stale search results (maybe acceptable), or worse, we may forget to delete some metadata search index values forever. Also, because the search index can also be updated via the CLI (python -m reader search update), passing extract_search_text gets much harder (you'd have to write/use a plugin).

A more static way of marking specific fields for indexing might be easier to implement (and it would also make it possible to get all the stuff that needs to be indexed from a single query). Something like this metadata value (YAML for convenience):

somefield: 1
sometextfield: my text
..search: sometextfield

As I said, I need to think more about it.

balki commented 3 years ago

Thank you for reconsidering! :)

Can you confirm this covers your use case? (If yes, I'll cut a new issue for it.)

Works good. But I don't understand how the key argument in iter_entry_metadata supposed to be used? Does this mean there can be more than one value for a key? If so which one will be returned by get_entry_metadata? If not, how is that different from get_entry_metadata (other than the output format)?

As mentioned before, I can make no promises of when this will happen (sometime in the next 6 months?)

No problem. As of now, I use a custom crude implementation based on xmltodict for parsing and tinydb for storage in my script. When I get to add to more features/rewrite I will reconsider the options. I will put a PR if I get to work on these features.

[Searchable text fields] From what I understand about your use case, this is lower priority than the entry metadata

Yes. I haven't given checked or thought about the cases you mention. I will check back when I need to use the feature. Thank you for the detailed response!

lemon24 commented 3 years ago

Regarding the key argument:

No, there's only one metadata value per key, it's just a different format; like the docs mention, get_feed_metadata() is the same as next(iter(get_feed_metadata(feed, key=key)), default), but with a custom exception instead of StopIteration.

get_entries(..., entry=...) and get_feeds(..., feed=...) are similar; this is because it's sometimes more convenient to work with an iterable; say, if you're getting filter arguments in a web app:

# compare

if 'entry' in request.args:
    entry = reader.get_entry(requests.args['entry'], None)
    entries = [entry] if entry else []
else:
    kwargs = dict(
        feed=request.args.get('feed'),
        read=as_bool(request.args.get('read')),
        important=as_bool(request.args.get('important')),
    )
    entries = reader.get_entries(**kwargs)

# with

kwargs = dict(
    feed=request.args.get('feed'),
    read=as_bool(request.args.get('read')),
    important=as_bool(request.args.get('important')),
    entry=reaques.args.get('entry'),
)
entries = reader.get_entries(**kwargs)

# ... and then use entries in a template ...

lemon24 commented 3 years ago

An update on searchable text fields:

In 1.17, I reserved specific metadata keys (by default, those starting with .reader.) for special use. I'm still not sure what the value will look like, but the special key for telling it to index metadata fields will be called .reader.search:

sometextfield: my text
.reader.search: sometextfield

(Prior to 1.17, there wasn't a standard, consistent way of naming this kind of stuff.)

lemon24 commented 3 years ago

Some early decisions:

Entry tag/metadata will be in new tables, based on the existing feed_* ones (we can just copy-paste the functions).

We'll refactor the tag/metadata Storage methods to accept both feeds and entries. Example:

_metadata_schema_info = {
  1: ('feed_', ('feed',)),
  2: ('entry_', ('feed', 'id')),
}

def set_metadata(self, object_id: Tuple[str, ...], key, value):
  table_prefix, fk_columns = self._metadata_schema_info[len(object_id)]
  ...

We'll postpone searchable text fields for now.

To do:

[ ] entry metadata
- [x] (maybe) rename iter_feed_metadata to get_feed_metadata, so we can have just get_entry_metadata (#183)
- [ ] create table and add migration
- [x] refactor storage methods
- [x] add core methods
- [ ] test core methods (likely parametrize test_feed_metadata)
- [ ] docs
- [ ] docstrings
- [ ] versionadded / versionchanged
- [x] API (likely noop)
- [ ] changelog
- [ ] user guide
- [ ] reader.entry_dedupe support
[ ] entry tags
- [ ] create table and add migration
- [x] refactor storage methods
- [x] add core methods
- [ ] test core methods (likely parametrize test_tags_basic)
- [ ] add {get,search}_{entries,entry_counts}(tags=...)
- [ ] test {get,search}_{entries,entry_counts}(tags=...) (likely starting from test_filtering_tags)
- [ ] docs
- [ ] docstrings
- [ ] versionadded / versionchanged
- [x] API (likely noop)
- [ ] changelog
- [ ] user guide
- [ ] reader.entry_dedupe support

2022 update: Development for this feature continues in #272, using the generic API in #266.

lemon24 commented 3 years ago

Renaming iter_feed_metadata to get_all_feed_metadata does not look nice, and results in get_all_feed_metadata_counts, which is particularly bad (renaming is needed for #183/#185).

Alternatives (possibly worse):

rename iter_feed_metadata to get_feed_metadata, and get/set/delete_feed_metadata to get/set/delete_feed_metadata_item; lots of changes, but fits with the mapping dunder methods at least
- update: went with this one
rename iter_feed_metadata to get_feed_metadatas ಠ_ಠ
rename iter_feed_metadata to get_feed_metadata_pairs; breaks len(get_feed_${plural}(...)) == get_feed_${singular}_counts(...) symmetry

lemon24 commented 3 years ago

Closing this, since I personally don't have a use case for it at the moment, nor enough bandwidth to implement/maintain it.

If anyone wants to take on this work (a detailed roadmap exists above), or is willing to pay for it (I can offer commercial support), please re-open the issue. Thank you!

lemon24 commented 3 years ago

New use case for entry metadata: caching entry read time (59c57b372464104c04d12846644c0b5ac8f9eeaa).

lemon24 commented 2 years ago

Hi @balki, starting with reader 2.10 (released today), it is possible to store arbitrary data on entries using resource tags.

Currently, it is still not possible to filter entries by entry tags (second use case in the issue description), but that will be added at some point in the future.

lemon24 / reader