kurtmckee / feedparser

Parse feeds in Python
https://feedparser.readthedocs.io
Other
1.99k stars 343 forks source link

Issues with the Media-RSS implementation #195

Open azmeuk opened 5 years ago

azmeuk commented 5 years ago

Hello, I noticed some issues with the media-rss implementation. Before trying to fix them, I would like to discuss it here.

is ignored

According to the Media-RSS specification, the <media:group> tag is used to group several links/representation for a same media. However, my understanding is that feedparser just ignores this tag, and consider every <media:content> as a new media.

It allows grouping of elements that are effectively the same content, yet different representations. For instance: the same song recorded in both the WAV and MP3 format. It's an optional element that must only be used for this purpose.

https://github.com/kurtmckee/feedparser/blob/d12d3bdd075bca71885ccb02e9b08ac04fcb8514/feedparser/namespaces/mediarss.py#L64-L66 https://github.com/kurtmckee/feedparser/blob/d12d3bdd075bca71885ccb02e9b08ac04fcb8514/feedparser/namespaces/mediarss.py#L119-L122

The description is set on the feed entry

The <media:description> tag belongs to the media, but feedparser updates the feed entry description.

https://github.com/kurtmckee/feedparser/blob/d12d3bdd075bca71885ccb02e9b08ac04fcb8514/feedparser/namespaces/mediarss.py#L91-L95

Some tags are missing

For instance, the <media:subtitle> tag is not handled by feedparser.

Attributes are ignored

When tags are handled, a lot of the attributes in the Media-RSS specification are just ignored. For instance, <media:description> can either be plain text or html but feedreader does not make a difference.

So...

I would like to tackle this issues, but there could be some backward compatibility problems. How can I manage this? I believe Media-RSS is not much used, and the simpler option for me is just to break the compatibility so feedparser can correctly respect the specification. What do you think?

buhtz commented 4 years ago

Could you please give us a short description about what MediaRSS is for. Maybe a real use case would improve the understanding.

azmeuk commented 4 years ago

Of course. Media-RSS is used to describe medias, such as audio or video files, and their metadata (thumbnails, description, number of views/listening, rating, links to read the media in different format etc.)

It is used in every youtube feeds (example) or peertube feeds (example though support should improve in an upcoming version).

chaimae26 commented 4 years ago

I have the same issue , did you solve it?

azmeuk commented 4 years ago

Actually this would take some time to fix. I am willing to do a patch, but I would like to be sure that it will merged in the end before I start.

@kurtmckee What do you think?

o-felixz commented 4 years ago

This is something we are very interested in as well, especially when it comes to children in media:content, such as media:title (i.e. associating e.g. image titles with the images themselves).

I have started work on a patch but the changes are breaking at this time (see example below).

Main changes:

  1. media:group (not part of below example) and media:content are now containers as expected. media:group may contain media:contents.
  2. media:{x} now generates media_{x} keys instead of {x} keys. The keys previously known as media_{x} are now known as media_{x}_details (this is mainly to make tags distinguishable from attributes of the parent media:{x})
  3. media:title is no longer used as a fallback for a missing title (consequence of 2. above. Fixable but probably violating expectations?)

Any thoughts on these changes and how they affect the parsed data?

@azmeuk Is this in line with what you had in mind or were you planning on something different?

@kurtmckee Is this in line with the project as a whole?


Input file ```xml Music Videos 101 http://www.foo.com Discussions of great videos The latest video from an artist http://www.foo.com/item1.htm dfdec888b72151965a34b4b59031290a producer's name artist's name music/artistname/album/song Oh, say, can you see, by the dawn's early light nonadult start=2002-10-13T09:00+01:00; end=2002-10-17T17:00+01:00; scheme=W3C-DTF ```
Parsed data WITHOUT changes ```json [ { "title": "The latest video from an artist", "title_detail": { "type": "text/plain", "language": null, "base": "", "value": "The latest video from an artist" }, "links": [ { "rel": "alternate", "type": "text/html", "href": "http://www.foo.com/item1.htm" } ], "link": "http://www.foo.com/item1.htm", "media_content": [ { "url": "http://www.foo.com/movie.mov", "filesize": "12216320", "type": "video/quicktime", "expression": "full" } ], "media_player": { "url": "http://www.foo.com/player?id=1111", "height": "200", "width": "400", "content": "" }, "media_hash": { "algo": "md5" }, "media_credit": [ { "role": "producer", "content": "producer's name" }, { "role": "artist", "content": "artist's name" } ], "credit": "artist's name", "tags": [ { "term": "music/artistname/album/song", "scheme": "http://blah.com/scheme", "label": null } ], "media_text": { "type": "plain" }, "media_rating": { "content": "nonadult" }, "rating": "nonadult", "validity": "start=2002-10-13T09:00+01:00;\n end=2002-10-17T17:00+01:00;\n scheme=W3C-DTF", "validity_start": "2002-10-13T09:00+01:00", "validity_start_parsed": [ 2002, 10, 13, 8, 0, 0, 6, 286, 0 ] } ] ```
Parsed data WITH changes ```json [ { "title": "The latest video from an artist", "title_detail": { "type": "text/plain", "language": null, "base": "", "value": "The latest video from an artist" }, "links": [ { "rel": "alternate", "type": "text/html", "href": "http://www.foo.com/item1.htm" } ], "link": "http://www.foo.com/item1.htm", "media_content": [ { "url": "http://www.foo.com/movie.mov", "filesize": "12216320", "type": "video/quicktime", "expression": "full", "media_player": { "url": "http://www.foo.com/player?id=1111", "height": "200", "width": "400", "content": "" }, "media_hash": { "algo": "md5" }, "media_credit_details": [ { "role": "producer", "content": "producer's name" }, { "role": "artist", "content": "artist's name" } ], "media_credit": "artist's name", "tags": [ { "term": "music/artistname/album/song", "scheme": "http://blah.com/scheme", "label": null } ], "media_text": { "type": "plain" }, "media_rating_details": { "content": "nonadult" }, "media_rating": "nonadult", "validity": "start=2002-10-13T09:00+01:00;\n end=2002-10-17T17:00+01:00;\n scheme=W3C-DTF", "validity_start": "2002-10-13T09:00+01:00", "validity_start_parsed": [ 2002, 10, 13, 8, 0, 0, 6, 286, 0 ] } ] } ] ```
Output diff ```diff ... "media_content": [ { "url": "http://www.foo.com/movie.mov", "filesize": "12216320", "type": "video/quicktime", - "expression": "full" - } - ], + "expression": "full", "media_player": { "url": "http://www.foo.com/player?id=1111", "height": "200", ... - "media_credit": [ + "media_credit_details": [ { "role": "producer", "content": "producer's name" }, { "role": "artist", "content": "artist's name" } ], - "credit": "artist's name", + "media_credit": "artist's name", ... "media_text": { "type": "plain" }, - "media_rating": { + "media_rating_details": { "content": "nonadult" }, - "rating": "nonadult", + "media_rating": "nonadult", "validity": "start=2002-10-13T09:00+01:00;\n end=2002-10-17T17:00+01:00;\n scheme=W3C-DTF", ... + } +] ```