Closed balki closed 3 years ago
Hello, and thank you for reaching out!
A quick disclaimer first, to set expectactions: Since I'm working on stuff based on my needs/interests and in my free time, I am not sure I will actually implement any of these any time soon. Also, I am reluctant to add features without multiple strong use cases, since aside from the initial developement, each new feature increases maintenance burden.
If you're ok with that, keep reading :)
The way you listed the use cases is spot-on, since while they all have to do with entry user data, they are separate features (they have different requirements and would be implemented quite differently).
I added my thoughts on each of them below, along with some possible alternatives and some clarifying questions. Please let me know if the alternatives help, and which of the features are more important to you (so I can prioritize).
Actions that should be possible with user text fields:
It should be possible to implement this using the existing entry search infrastructure.
Currently, each entry has one or more rows like the following in the search index (columns that identify which entry they belong to omitted):
title | feed | content | _path |
---|---|---|---|
Entry title | Feed title | Short summary | .summary |
Entry title | Feed title | Longer content | .content[0].value |
To allow limiting the search to the notes from the query itself, they should be a different column:
title | feed | content | note | _path |
---|---|---|---|---|
Entry title | Feed title | Short summary | null | .summary |
Entry title | Feed title | Longer content | null | .content[0].value |
Entry title | Feed title | null | User note | .user_notes[0].value |
Then, a query like hello note: great
would mean entries that match "hello" anywhere, and at least one of their notes matches "great".
It would also be useful for the Search interface to allow for more granular updates as well. Currently, it can only do "update the search index for everything that needs updating"; it would be nice to allow the user to "update the search index for this specific entry I just added a note to".
TBD: What metadata do we store for notes (date added etc.)? Are notes conceptually an array / set, or a dict with a user-provided key (like feed metadata is)?
This would be similar to the existing feed tags (for consistency).
Actions that should be possible with entry tags:
To add this, I'd like to collect at least a few new use cases for individual tags.
read
and important
can be thought of as tags as well; they're separate things now because of performance reasons and to simplify the implementation (very early on, they were actually implemented as tags).
Currently, reader does not associate any semantics to read
and important
(that is, you can use them for anything, and present them in the UI in any way).
read
is obviously useful for what the name implies.important
in the web app to mean something like important / favorite / star / saved (when I added important
, I wasn't sure if these are different from one another, so I just added a single flag).Can you use important
for the "saved item" use case you describe? If not, why? (Do you plan to use it for something else? If yes, what is the difference between "important" and "saved"?)
This would be similar to the existing feed metadata (for consistency).
Actions that should be possible with entry metadata:
To add this, I'd like to collect at least a few new use cases.
Here are my thoughts for your specific "download info" use case:
I mainly see this use case as a sort of cache / preloader for enclosures.
Since the file itself is stored externally, it makes sense to store metadata about it externally as well. (If the file gets deleted, is that metadata still valid/useful?)
Path: I would use a standard way of generating a path to the file, something derived from the unique identifiers of the feed / entry / enclosure, for example <MD5 hash of feed URL>/<MD5 hash of entry id>/<MD5 hash of enclosure URL>
. Alternatively, using the <MD5 hash of enclosure URL>
alone might be enough – if two entries have the same enclosure, do you need to download the file twice? In either case, you'd need to find a scheme that gets around OS file-per-directory limitation (schemes like a0/2d/a02dab1856badd8c01d18047ae58cd46
come to mind). For increased readability, it may be useful to add some fragment of said id after the hash (sanitized because of OS character and length limits), e.g. <MD5 hash of enclosure URL>-original-name.mp3
.
Download success: The pattern browsers or some torrent clients use might be acceptable: download the file to filename.ext.part
; after the download finishes successfully, rename it to filename.ext
.
How many times the enclosure was downloaded (and other rarely-used arbitrary metadata about enclosures): A filename.ext.meta
JSON file might be enough for this.
Some downsides of the approach described above are that it's hard / inefficient to:
Are there valid/useful use cases for these two things?
I've linked the the comment above from the dev notes page, so I don't forget about it.
I'll be closing this for now.
@balki, please re-open if you have any comments (and again, thank you for requesting this).
For anyone reading this in the future, if you need any of the features described above, either cut another issue or re-open this one.
Apologies for late reply. First brief background. I am writing a podcast application for which I was investigating if this library will serve as data layer. it needed other 'feed' like features like 'marking as read', 'getting new entries', etc,. but one thing that was missing was that I have to save the local path where I download the mp3 file. Even I keep that information in a different database, I will need to store some linking information here (like foreign key). E.g. from this library, I can get all the unread entries, get their foreign keys, get the file paths from other db and then start playing them. I can also do the reverse like take guid from the feed entry and store it in the other db as a unique key.
But if I could completely avoid the other db and store all the extra information here in the 'user data' field (serialized as json), it would make management easier. If I am anyway going to have other db, I am more inclined to do the feed maintenance myself as well instead of maintain two different databases.
since while they all have to do with entry user data, they are separate features (they have different requirements and would be implemented quite differently).
IMHO, In a lot cases I believe application will want to save some data for each entry. It is impossible to predict all use-cases and support in the library level. It is a common pattern in software libraries. e.g. when registering a callback with a library, it usually provides a field called 'client_data' (a void *
in a c library) which is arbitrary data specific to the application
One of my other rss scripts, does this, gets entries from rss, sends the links to a telegram channel. It does all the 'feed' like features like fetch rss, mark processed ones as read, update entries. And then it saves how many 👍, 👎, 😁 reactions each post gets. If I had to use this library for that, I wouldn't have anywhere to save that info. If I save a json info file for each post, it would just be too much file opening/closing/reading/writing on every click.
If you decide to implement, I suggest the following,
extract_search_text
which is a function which takes user data and returns a text that should be added to search index.Thank you for explaining your use case more and which features you need most.
The C lib "user data" pointer comparison sold me on the arbitrary metadata bit; it also helps that it's probably easiest to implement.
For symmetry with the feed API, it'll look exactly like it (it makes it easier to learn/use):
iter_entry_metadata(entry, *, key=None) -> [(key, value), ...]
get_entry_metadata(entry, key, default=no value) -> value or default
set_entry_metadata(entry, key, value)
delete_entry_metadata(entry, key)
This is similar to what you proposed above, but instead of exactly one piece of user data, there can be many.
Can you confirm this covers your use case? (If yes, I'll cut a new issue for it.)
As mentioned before, I can make no promises of when this will happen (sometime in the next 6 months?). I am willing to accept pull requests and/or guide you though implementing this, though (it should be relatively easy, since it's similar to the existing feeds feature). Please let me know.
I was also thinking to reuse metadata for the searchable text fields, and having a way of signalling "this field of this metadata should be indexed".
I have to think a bit more about how to implement this, though.
From what I understand about your use case, this is lower priority than the entry metadata (and depends on it in any case).
(The discussion below is more to clarify my thoughts.)
The arbitrary callable may add some issues due to how things are indexed (that is, storage and search are separate components by design, and the way they communicate is a bit convoluted). This is both about performance and consistency.
For example: What happens if you have 2 Reader instances, where extract_search_text returns different fields, and use them alternatively? They may return stale search results (maybe acceptable), or worse, we may forget to delete some metadata search index values forever. Also, because the search index can also be updated via the CLI (python -m reader search update
), passing extract_search_text gets much harder (you'd have to write/use a plugin).
A more static way of marking specific fields for indexing might be easier to implement (and it would also make it possible to get all the stuff that needs to be indexed from a single query). Something like this metadata value (YAML for convenience):
somefield: 1
sometextfield: my text
..search: sometextfield
As I said, I need to think more about it.
Thank you for reconsidering! :)
Can you confirm this covers your use case? (If yes, I'll cut a new issue for it.)
Works good. But I don't understand how the key
argument in iter_entry_metadata
supposed to be used? Does this mean there can be more than one value for a key
? If so which one will be returned by get_entry_metadata
? If not, how is that different from get_entry_metadata
(other than the output format)?
As mentioned before, I can make no promises of when this will happen (sometime in the next 6 months?)
No problem. As of now, I use a custom crude implementation based on xmltodict for parsing and tinydb for storage in my script. When I get to add to more features/rewrite I will reconsider the options. I will put a PR if I get to work on these features.
[Searchable text fields] From what I understand about your use case, this is lower priority than the entry metadata
Yes. I haven't given checked or thought about the cases you mention. I will check back when I need to use the feature. Thank you for the detailed response!
Regarding the key
argument:
No, there's only one metadata value per key, it's just a different format; like the docs mention, get_feed_metadata()
is the same as next(iter(get_feed_metadata(feed, key=key)), default)
, but with a custom exception instead of StopIteration.
get_entries(..., entry=...)
and get_feeds(..., feed=...)
are similar; this is because it's sometimes more convenient to work with an iterable; say, if you're getting filter arguments in a web app:
# compare
if 'entry' in request.args:
entry = reader.get_entry(requests.args['entry'], None)
entries = [entry] if entry else []
else:
kwargs = dict(
feed=request.args.get('feed'),
read=as_bool(request.args.get('read')),
important=as_bool(request.args.get('important')),
)
entries = reader.get_entries(**kwargs)
# with
kwargs = dict(
feed=request.args.get('feed'),
read=as_bool(request.args.get('read')),
important=as_bool(request.args.get('important')),
entry=reaques.args.get('entry'),
)
entries = reader.get_entries(**kwargs)
# ... and then use entries in a template ...
An update on searchable text fields:
In 1.17, I reserved specific metadata keys (by default, those starting with .reader.
) for special use. I'm still not sure what the value will look like, but the special key for telling it to index metadata fields will be called .reader.search
:
sometextfield: my text
.reader.search: sometextfield
(Prior to 1.17, there wasn't a standard, consistent way of naming this kind of stuff.)
Some early decisions:
Entry tag/metadata will be in new tables, based on the existing feed_*
ones (we can just copy-paste the functions).
We'll refactor the tag/metadata Storage methods to accept both feeds and entries. Example:
_metadata_schema_info = {
1: ('feed_', ('feed',)),
2: ('entry_', ('feed', 'id')),
}
def set_metadata(self, object_id: Tuple[str, ...], key, value):
table_prefix, fk_columns = self._metadata_schema_info[len(object_id)]
...
We'll postpone searchable text fields for now.
To do:
iter_feed_metadata
to get_feed_metadata
, so we can have just get_entry_metadata
(#183)test_feed_metadata
)test_tags_basic
){get,search}_{entries,entry_counts}(tags=...)
{get,search}_{entries,entry_counts}(tags=...)
(likely starting from test_filtering_tags
)2022 update: Development for this feature continues in #272, using the generic API in #266.
Renaming iter_feed_metadata
to get_all_feed_metadata
does not look nice, and results in get_all_feed_metadata_counts
, which is particularly bad (renaming is needed for #183/#185).
Alternatives (possibly worse):
iter_feed_metadata
to get_feed_metadata
, and get/set/delete_feed_metadata
to get/set/delete_feed_metadata_item
; lots of changes, but fits with the mapping dunder methods at least
iter_feed_metadata
to get_feed_metadatas
ಠ_ಠiter_feed_metadata
to get_feed_metadata_pairs
; breaks len(get_feed_${plural}(...)) == get_feed_${singular}_counts(...)
symmetryClosing this, since I personally don't have a use case for it at the moment, nor enough bandwidth to implement/maintain it.
If anyone wants to take on this work (a detailed roadmap exists above), or is willing to pay for it (I can offer commercial support), please re-open the issue. Thank you!
New use case for entry metadata: caching entry read time (59c57b372464104c04d12846644c0b5ac8f9eeaa).
Hi @balki, starting with reader 2.10 (released today), it is possible to store arbitrary data on entries using resource tags.
Currently, it is still not possible to filter entries by entry tags (second use case in the issue description), but that will be added at some point in the future.
Currently two bits of user data can be added to a feed entry (mark_as_read, mark_as_important).
Possible use cases:
Optionally include user_data to search (as an argument to make_reader).