More metadata - Githubissues

MaienM commented 2 years ago

When archiving with --include-meta and --include-episode-meta some of the metadata of the podcast and its episodes are saved. The set that is currently saved is only a subset of all the available metadata though, and I'd like to increase this.

It would be trivial to just include all metadata (probably excluding the items listing for the podcast metadata), and I'd totally be happy with that as a solution, but I wanted to try and make it a bit more flexible.

This draft adds two new flags, --include-meta-fields and --include-episode-meta-fields. As their names suggest these can be used to indicate which extra fields should be included. They can be used multiple times, have simple globbing support (foo*), and support negation (!foobar).

This does not handle filtering on nested keys yet. I didn't realize that properties like itunes:subtitle resulted in a nested structure when making my original design. I can fix this, but I wanted to get some feedback on this before investing more time into it.

A few questions from my side:

Would you be interested in functionality to include more metadata?
If so, are you interested in the approach I have chosen (flags to specify what to include), or would you want to go for a simpler include-all, or something else entirely?
If so, do you have any feelings regarding how to handle nested keys? I'm leaning towards the following:
- When encountering an object, don't apply the filter to the object itself but instead process it's contents. If any of its contents are to be included the (subset) of the object will be included, and otherwise it will be left out entirely.
- When checking the filter for nested keys combine the keys with those of their parent(s). For example, itunes: { subtitle: "foo" } would result in a key of itunes.subtitle which would then be checked against the filter.
- Change * to match anything but ..
- Add ** that doest match anything, including ..
This way one could easily include everything by passing **, or only non-object properties by passing *, or perhaps all itunes properties by passing itunes.**, etc.

MaienM commented 2 years ago

I like the idea of putting the old behaviours into the defaults for this flag. This would even allow people to exclude the current defaults if they so desire (with a simple !*), which is pretty neat, and I'm always for more flexibility. There is one minor problem with this though: the descriptionText field in the current episode metadata comes from the contentSnippet field from the feed. All other fields in the metadata match the names from the feed itself, but that one does not.

We could fix this by adding some remapping functionality, so that a rule like contentSnippet as descriptionText or something similar would include the field under a different name. I wouldn't mind adding this, but I worry about the added complexity. Wouldn't want to turn this into something noone uses! Then again, being able to get exactly what you want in the structure you want it in does sound quite appealing. And it'd be an optional thing, so it wouldn't complicate the "normal" rules at all.

Aside from that, how do you envision the options working? Just adding some default rules wouldn't take away the need for the checks, since you could still add rules without enabling the core include-meta feature. Would you want to merge the options? Have specifying globs implicitly turn on the include-meta feature? Something else?

lightpohl commented 2 years ago

There is one minor problem with this though: the descriptionText field in the current episode metadata comes from the contentSnippet field from the feed. All other fields in the metadata match the names from the feed itself, but that one does not.

Oh! Good call. I'm totally up for switching it over to contentSnippet. I'll just make sure the mark the release correctly — I think it's worth it.

Aside from that, how do you envision the options working? Just adding some default rules wouldn't take away the need for the checks, since you could still add rules without enabling the core include-meta feature. Would you want to merge the options? Have specifying globs implicitly turn on the include-meta feature? Something else?

Similar to --archive, we can have --include-meta and --include-episode-meta optionally include arguments. If passed in without an argument, use the default selectors for the fields as before. And if passed in with field selectors, use those for grabbing the meta fields. What do you think? 😄

MaienM commented 2 years ago

Oh! Good call. I'm totally up for switching it over to contentSnippet. I'll just make sure the mark the release correctly — I think it's worth it.

:+1:

Similar to --archive, we can have --include-meta and --include-episode-meta optionally include arguments. If passed in without an argument, use the default selectors for the fields as before. And if passed in with field selectors, use those for grabbing the meta fields. What do you think? smile

Sure, that works. If you pass a field selector, would you want to add to the defaults or override them? That is, should --include-episode-meta foo result in ['title', 'contentSnippet', 'pubDate', 'creator', 'foo'] or ['foo']? (Mind that if we go for the former you could still use --include-episode-meta '!*' --include-episode-meta foo to only get foo if that is what you want.)

lightpohl commented 2 years ago

I think option 2 makes the most sense to me: --include-episode-meta foo resulting in ['foo'].

Getting more and more excited to have this feature added. 💪

MaienM commented 2 years ago

Should be functional now! No documentation yet, but I'll let you play around with it and see if you like it/manage to break it before we worry about that.

MaienM commented 2 years ago

While preparing examples for the readme I noticed that the module used to parse the feed doesn't actually include all fields. Furthermore, for some of the fields that it does include it does so with a transformation of some kind (e.g. itunes:category results in itunes.categories and itunes.categoriesWithSubs).

This means that more work is needed to actually include all metadata. It also means that if we were to merge this as it currently is we're exposing these transformations the underlying module makes, which would make changing to another way of acquiring the full metadata a breaking change.

I have some ideas for potential solutions that I'll look into. I'll change this back to a draft for now until we have a solution of some sort.

MaienM commented 2 years ago

I did some more work on this. I hooked into the parser to get access to the raw data from the feed while still allowing us to be mostly ignorant about the type of feed (Atom/RSS 0.9/RSS 1.0/RSS 2.0). This gets us access to the full data, but this is unfortunately not without issues.

The first problem is that the mapping from XML to JSON is... icky. It's an unfortunate but unavoidable artifact of mapping XML to JSON; XML allows a single tag to have attributes, child tags and contents, plus it allows repeating a tag (meaning anything is potentially an array), whereas in JSON each entry in an object can only be a single thing, and arrays are explicit even when they only have one entry. To capture all this complexity in JSON results in a rather verbose structure:

XML

``` xml 8-4 Play no ```

JSON

``` json { "title": [ "8-4 Play" ], "itunes:category": [ { "$": { "text": "Leisure" }, "itunes:category": [ { "$": { "text": "Video Games" } } ] }, { "$": { "text": "Technology" } } ], "podcast:locked": [ { "_": "no", "$": { "owner": "info@8-4.jp" } } ] } ```

To aleviate this I've added an option to output to XML instead of JSON. This uses the same library that the RSS parsing library uses, so it round trips properly and it's not really a new dependency.

Another issue is that this exposes details of the underlying feed type to the metadata. This is unavoidable once you get into the full details, but it is a shame to lose out on the simple and uniform metadata we had previously.

The way I've handled this is by storing the raw data under the raw key in the feed/episode metadata. This way you can get access to it if you want to/need something that isn't in the normalized metadata without losing out on the benefits of a uniform format.

These changes add a bit more complexity when compared to the last version, but I've done my best to minimize the complexity while still offering a large degree of flexibility. Overall I'm relatively happy with the result. What do you think?

MaienM commented 2 years ago

I think a lot of people are in the same camp as you and won't be interested in raw. It's not part of the default filters and it wouldn't even be included by a * rule, but we could certainly put it behind a flag.

We could even go a bit further and split this off into a separate metadata file. This could also somewhat alleviate the JSON mapping issue; we could stick to JSON for the normal metadata and use XML for the raw export. This would allow people like me who would like to have a dump of the raw metadata without burdening the JSON output. We could just keep this as a simple all-or-nothing flag without rules support.

Separately I think the customFields thing is worth exploring. I had seen this option but had disregarded it thus far since it didn't fit my use-case of wanting everything, but it would be good as an option for those that want specific extra bits of metadata. I'll create a basic implementation of this and then we can see how we like it and further refine it that way.

One concern I have which is worsened by the addition of customFields is that this is turning into a (potentially) unwieldy amount of CLI arguments. It might be worth considering an option to specify these rules/mappings in a separate file that can be referred to. This should probably be a separate PR, but I am curious what you think. Something similar to mocha.opts? A config file in JSON format?

lightpohl commented 2 years ago

Sorry for the long delay — I was AFK for quite a bit!

I think in an effort to not add too much complexity to the project, adding support for XML and additional parsing might best be supported as an additional script someone could run separately. A separate script that fetches the RSS feed and parses out the metadata one might want per episode would be much simpler and easier to adapt over time. I plan on adding a --include-feed param that will save the raw file into the output directly which then can be targeted by additional scripts.

lightpohl / podcast-dl

More metadata #38