mpgirro / stalla

A Kotlin and Java library for RSS podcast feeds
https://stalla.dev
BSD 3-Clause "New" or "Revised" License
25 stars 5 forks source link

Statistical feed analysis #81

Open mpgirro opened 3 years ago

mpgirro commented 3 years ago

As discussed in #28, statistical information about namespaces (elements/attributes) usage in feeds will help us to determine what we should support in the future.

This issue is for result posting and discussion.

mpgirro commented 3 years ago

Here are some first results.

The feeds are collected from querying the Fyyd/Gpodder/Panoptikum directories (just because I already had some old Java code I could adapt) and a local txt file I used for testing something else a few years ago.

Feeds processed successfully: 1362
Feeds loading failed: 235
Feeds parsing failed: 73

Analysing the successfully loaded and parsed XMLs leads to the following distribution of namespaces and their elements/attributes (every element/attribute is counted once per feed):

* com-wordpress:feed-additions:1 (in 63 feeds)
  - post-id (Element, in 62 feeds)
  - site (Element, in 62 feeds)
* http://a9.com/-/spec/opensearchrss/1.0/ (in 16 feeds)
  - itemsPerPage (Element, in 16 feeds)
  - startIndex (Element, in 16 feeds)
  - totalResults (Element, in 16 feeds)
* http://backend.userland.com/blogChannelModule (in 1 feeds)
* http://backend.userland.com/creativeCommonsRssModule (in 38 feeds)
  - license (Element, in 35 feeds)
* http://bbc.co.uk/2009/01/ppgRss (in 38 feeds)
  - canonical (Element, in 37 feeds)
  - enclosureLegacy (Element, in 37 feeds)
  - enclosureSecure (Element, in 37 feeds)
  - network (Element, in 38 feeds)
  - seriesDetails (Element, in 38 feeds)
  - systemRef (Element, in 38 feeds)
* http://bitlove.org (in 18 feeds)
  - guid (Attribute, in 16 feeds)
* http://developer.longtailvideo.com/ (in 7 feeds)
  - talkId (Element, in 6 feeds)
* http://fireside.fm/modules/rss/fireside (in 5 feeds)
  - genDate (Element, in 5 feeds)
  - hostname (Element, in 5 feeds)
  - playerEmbedCode (Element, in 5 feeds)
  - playerURL (Element, in 5 feeds)
* http://madskills.com/public/xml/rss/module/trackback/ (in 1 feeds)
* http://ogp.me/ns# (in 1 feeds)
* http://pipes.yahoo.com (in 1 feeds)
  - meta (Element, in 1 feeds)
* http://podcastaddict.com (in 2 feeds)
* http://podlove.org/simple-chapters (in 250 feeds)
  - chapter (Element, in 176 feeds)
  - chapters (Element, in 176 feeds)
* http://purl.org/dc/elements/1.1/ (in 401 feeds)
  - creator (Element, in 265 feeds)
  - date (Element, in 15 feeds)
  - identifier (Element, in 2 feeds)
  - language (Element, in 13 feeds)
  - rights (Element, in 11 feeds)
* http://purl.org/dc/terms/ (in 6 feeds)
  - created (Element, in 6 feeds)
  - modified (Element, in 6 feeds)
* http://purl.org/rss/1.0/modules/content (in 1 feeds)
* http://purl.org/rss/1.0/modules/content/ (in 1000 feeds)
  - encoded (Element, in 781 feeds)
* http://purl.org/rss/1.0/modules/slash/ (in 250 feeds)
  - comments (Element, in 212 feeds)
* http://purl.org/rss/1.0/modules/syndication/ (in 274 feeds)
  - updateBase (Element, in 1 feeds)
  - updateFrequency (Element, in 219 feeds)
  - updatePeriod (Element, in 219 feeds)
* http://purl.org/rss/1.0/modules/taxonomy/ (in 10 feeds)
* http://purl.org/syndication/history/1.0 (in 231 feeds)
* http://purl.org/syndication/thread/1.0 (in 12 feeds)
  - total (Element, in 12 feeds)
* http://radiofrance.fr/Lancelot/Podcast# (in 2 feeds)
  - businessReference (Element, in 1 feeds)
  - magnetothequeID (Element, in 1 feeds)
  - originStation (Element, in 2 feeds)
* http://rdfs.org/sioc/ns# (in 1 feeds)
* http://rdfs.org/sioc/types# (in 1 feeds)
* http://rssnamespace.org/feedburner/ext/1.0 (in 195 feeds)
  - browserFriendly (Element, in 13 feeds)
  - emailServiceId (Element, in 13 feeds)
  - feedFlare (Element, in 48 feeds)
  - feedburnerHostname (Element, in 13 feeds)
  - info (Element, in 191 feeds)
  - origEnclosureLink (Element, in 102 feeds)
  - origLink (Element, in 126 feeds)
* http://schema.org/ (in 1 feeds)
* http://schemas.google.com/blogger/2008 (in 4 feeds)
* http://schemas.google.com/g/2005 (in 4 feeds)
* http://search.yahoo.com/mrss (in 1 feeds)
  - restriction (Element, in 1 feeds)
* http://search.yahoo.com/mrss/ (in 597 feeds)
  - category (Element, in 200 feeds)
  - content (Element, in 266 feeds)
  - copyright (Element, in 137 feeds)
  - credit (Element, in 199 feeds)
  - description (Element, in 194 feeds)
  - group (Element, in 1 feeds)
  - keywords (Element, in 180 feeds)
  - player (Element, in 28 feeds)
  - rating (Element, in 226 feeds)
  - restriction (Element, in 6 feeds)
  - rights (Element, in 12 feeds)
  - thumbnail (Element, in 187 feeds)
  - title (Element, in 35 feeds)
* http://vemedio.com/dtds/atom/related-1.0.dtd (in 1 feeds)
  - apple-itunes-app (Attribute, in 1 feeds)
* http://web.resource.org/cc/ (in 83 feeds)
* http://webns.net/mvcb/ (in 28 feeds)
  - errorReportsTo (Element, in 1 feeds)
  - generatorAgent (Element, in 1 feeds)
* http://wellformedweb.org/CommentAPI/ (in 255 feeds)
  - comment (Element, in 2 feeds)
  - commentRss (Element, in 170 feeds)
* http://www.adobe.com/amp/1.0 (in 2 feeds)
  - background (Element, in 1 feeds)
  - banner (Element, in 2 feeds)
  - logo (Element, in 2 feeds)
  - networkBackground (Element, in 1 feeds)
  - networkHalfBanner (Element, in 2 feeds)
  - networkLogo (Element, in 2 feeds)
  - networkSmallLogo (Element, in 2 feeds)
  - networkWebsite (Element, in 2 feeds)
* http://www.apple.com/iweb (in 2 feeds)
* http://www.ard.de/ardNamespace (in 12 feeds)
  - sendereihe (Element, in 12 feeds)
  - visibility (Element, in 12 feeds)
  - visibleFrom (Element, in 12 feeds)
  - visibleUntil (Element, in 12 feeds)
* http://www.freie-radios.net/namespaces/frn (in 2 feeds)
  - art (Element, in 2 feeds)
  - id (Element, in 2 feeds)
  - laenge (Element, in 2 feeds)
  - language (Element, in 2 feeds)
  - last_update (Element, in 2 feeds)
  - licence (Element, in 2 feeds)
  - radio (Element, in 2 feeds)
  - serie (Element, in 2 feeds)
  - title (Element, in 2 feeds)
* http://www.georss.org/georss (in 98 feeds)
  - box (Element, in 3 feeds)
  - featurename (Element, in 3 feeds)
  - point (Element, in 7 feeds)
* http://www.google.com/schemas/play-podcasts/1.0 (in 427 feeds)
  - author (Element, in 41 feeds)
  - block (Element, in 14 feeds)
  - category (Element, in 104 feeds)
  - description (Element, in 53 feeds)
  - email (Element, in 41 feeds)
  - explicit (Element, in 52 feeds)
  - image (Element, in 26 feeds)
  - summary (Element, in 1 feeds)
* http://www.google.com/schemas/play-podcasts/1.0/ (in 15 feeds)
* http://www.google.com/schemas/play-podcasts/1.0/play-podcasts.xsd (in 1 feeds)
* http://www.itunes.com/DTDs/Podcast-1.0.dtd (in 12 feeds)
  - author (Element, in 11 feeds)
  - category (Element, in 11 feeds)
  - duration (Element, in 11 feeds)
  - email (Element, in 11 feeds)
  - explicit (Element, in 9 feeds)
  - image (Element, in 12 feeds)
  - keywords (Element, in 11 feeds)
  - link (Element, in 10 feeds)
  - name (Element, in 10 feeds)
  - new-feed-url (Element, in 4 feeds)
  - owner (Element, in 9 feeds)
  - subtitle (Element, in 11 feeds)
  - summary (Element, in 11 feeds)
* http://www.itunes.com/dtds/podcast-1.0.dtd (in 1322 feeds)
  - author (Element, in 1317 feeds)
  - block (Element, in 419 feeds)
  - category (Element, in 1287 feeds)
  - complete (Element, in 10 feeds)
  - copyright (Element, in 2 feeds)
  - duration (Element, in 1224 feeds)
  - email (Element, in 1296 feeds)
  - episode (Element, in 496 feeds)
  - episodeType (Element, in 690 feeds)
  - explicit (Element, in 1270 feeds)
  - image (Element, in 1294 feeds)
  - isClosedCaptioned (Element, in 2 feeds)
  - keywords (Element, in 665 feeds)
  - link (Element, in 6 feeds)
  - name (Element, in 1269 feeds)
  - new-feed-url (Element, in 245 feeds)
  - order (Element, in 6 feeds)
  - owner (Element, in 1302 feeds)
  - season (Element, in 129 feeds)
  - subitle (Element, in 1 feeds)
  - subtitle (Element, in 1218 feeds)
  - summary (Element, in 1264 feeds)
  - title (Element, in 536 feeds)
  - type (Element, in 702 feeds)
* http://www.itunesu.com/feed (in 1 feeds)
  - category (Element, in 1 feeds)
* http://www.rawvoice.com/rawvoiceRssModule/ (in 127 feeds)
  - donate (Element, in 17 feeds)
  - embed (Element, in 2 feeds)
  - frequency (Element, in 54 feeds)
  - isHD (Element, in 1 feeds)
  - isHd (Element, in 1 feeds)
  - location (Element, in 51 feeds)
  - poster (Element, in 3 feeds)
  - rating (Element, in 38 feeds)
  - subscribe (Element, in 85 feeds)
* http://www.rssboard.org/media-rss (in 15 feeds)
  - category (Element, in 1 feeds)
  - content (Element, in 12 feeds)
  - copyright (Element, in 1 feeds)
  - credit (Element, in 2 feeds)
  - description (Element, in 2 feeds)
  - keywords (Element, in 1 feeds)
  - rating (Element, in 2 feeds)
  - thumbnail (Element, in 1 feeds)
  - title (Element, in 6 feeds)
* http://www.spotify.com/ns/rss (in 17 feeds)
  - countryOfOrigin (Element, in 3 feeds)
* http://www.w3.org/1999/02/22-rdf-syntax-ns# (in 121 feeds)
  - resource (Attribute, in 1 feeds)
* http://www.w3.org/1999/xhtml (in 4 feeds)
  - body (Element, in 1 feeds)
  - meta (Element, in 3 feeds)
* http://www.w3.org/2000/01/rdf-schema# (in 1 feeds)
* http://www.w3.org/2000/xmlns/ (in 1358 feeds)
  - Atom (Attribute, in 1 feeds)
  - acast (Attribute, in 58 feeds)
  - admin (Attribute, in 28 feeds)
  - amp (Attribute, in 2 feeds)
  - anchor (Attribute, in 6 feeds)
  - ard (Attribute, in 12 feeds)
  - art19 (Attribute, in 15 feeds)
  - atom (Attribute, in 1203 feeds)
  - atom10 (Attribute, in 193 feeds)
  - audioboom (Attribute, in 12 feeds)
  - bitlove (Attribute, in 18 feeds)
  - blogChannel (Attribute, in 1 feeds)
  - blogger (Attribute, in 4 feeds)
  - cba (Attribute, in 2 feeds)
  - cc (Attribute, in 83 feeds)
  - content (Attribute, in 1006 feeds)
  - creativeCommons (Attribute, in 38 feeds)
  - dc (Attribute, in 401 feeds)
  - dcterms (Attribute, in 6 feeds)
  - feedburner (Attribute, in 195 feeds)
  - feedpress (Attribute, in 49 feeds)
  - fh (Attribute, in 231 feeds)
  - fireside (Attribute, in 5 feeds)
  - foaf (Attribute, in 1 feeds)
  - frn (Attribute, in 2 feeds)
  - fyyd (Attribute, in 64 feeds)
  - gd (Attribute, in 4 feeds)
  - geo (Attribute, in 95 feeds)
  - georss (Attribute, in 98 feeds)
  - googleplay (Attribute, in 444 feeds)
  - itunes (Attribute, in 1337 feeds)
  - itunesu (Attribute, in 1 feeds)
  - iweb (Attribute, in 2 feeds)
  - jwplayer (Attribute, in 7 feeds)
  - media (Attribute, in 610 feeds)
  - npr (Attribute, in 6 feeds)
  - nprml (Attribute, in 6 feeds)
  - og (Attribute, in 1 feeds)
  - omny (Attribute, in 26 feeds)
  - openSearch (Attribute, in 16 feeds)
  - pa (Attribute, in 2 feeds)
  - pingback (Attribute, in 20 feeds)
  - podaccess (Attribute, in 24 feeds)
  - podcast (Attribute, in 301 feeds)
  - podcastRF (Attribute, in 2 feeds)
  - ppg (Attribute, in 38 feeds)
  - psc (Attribute, in 276 feeds)
  - rawvoice (Attribute, in 127 feeds)
  - rdf (Attribute, in 121 feeds)
  - rdfs (Attribute, in 1 feeds)
  - related (Attribute, in 1 feeds)
  - sc (Attribute, in 4 feeds)
  - schema (Attribute, in 1 feeds)
  - sioc (Attribute, in 1 feeds)
  - sioct (Attribute, in 1 feeds)
  - skos (Attribute, in 1 feeds)
  - slash (Attribute, in 250 feeds)
  - spotify (Attribute, in 18 feeds)
  - sy (Attribute, in 274 feeds)
  - taxo (Attribute, in 10 feeds)
  - thr (Attribute, in 12 feeds)
  - trackback (Attribute, in 1 feeds)
  - wfw (Attribute, in 255 feeds)
  - xhtml (Attribute, in 4 feeds)
  - xmlns (Attribute, in 87 feeds)
  - xsd (Attribute, in 1 feeds)
  - xsi (Attribute, in 4 feeds)
* http://www.w3.org/2001/XMLSchema# (in 1 feeds)
* http://www.w3.org/2001/XMLSchema-instance (in 4 feeds)
* http://www.w3.org/2003/01/geo/wgs84_pos# (in 95 feeds)
  - lat (Element, in 13 feeds)
  - long (Element, in 13 feeds)
* http://www.w3.org/2004/02/skos/core# (in 1 feeds)
* http://www.w3.org/2005/Atom (in 1205 feeds)
  - contributor (Element, in 145 feeds)
  - email (Element, in 2 feeds)
  - facebook (Element, in 1 feeds)
  - id (Element, in 4 feeds)
  - link (Element, in 1166 feeds)
  - name (Element, in 145 feeds)
  - updated (Element, in 4 feeds)
  - uri (Element, in 70 feeds)
* http://www.w3.org/2005/Atom/ (in 25 feeds)
  - link (Element, in 8 feeds)
* http://www.w3.org/XML/1998/namespace (in 37 feeds)
  - base (Attribute, in 31 feeds)
  - lang (Attribute, in 6 feeds)
* http://xmlns.com/foaf/0.1/ (in 1 feeds)
* https://access.acast.com/schema/1.0/ (in 4 feeds)
* https://anchor.fm/xmlns (in 6 feeds)
  - station (Element, in 1 feeds)
  - support (Element, in 1 feeds)
* https://api.npr.org/nprml (in 6 feeds)
* https://art19.com/xmlns/rss-extensions/1.0 (in 15 feeds)
* https://audioboom.com/rss/1.0 (in 12 feeds)
  - banner-image (Element, in 7 feeds)
* https://cba.fro.at/help#feeds (in 2 feeds)
  - attachmentID (Element, in 2 feeds)
  - broadcastDate (Element, in 2 feeds)
  - containsCopyright (Element, in 2 feeds)
  - duration (Element, in 2 feeds)
  - productionDate (Element, in 2 feeds)
  - teaser (Element, in 2 feeds)
* https://feed.press/xmlns (in 49 feeds)
  - locale (Element, in 49 feeds)
  - newsletterId (Element, in 3 feeds)
  - podcastId (Element, in 18 feeds)
* https://fyyd.de/fyyd-ns/ (in 64 feeds)
  - verify (Element, in 64 feeds)
* https://github.com/Podcastindex-org/podcast-namespace/blob/main/docs/1.0.md (in 72 feeds)
  - chapters (Element, in 4 feeds)
  - funding (Element, in 14 feeds)
  - location (Element, in 19 feeds)
  - locked (Element, in 7 feeds)
  - person (Element, in 5 feeds)
  - transcript (Element, in 1 feeds)
* https://omny.fm/rss-extensions (in 26 feeds)
  - clipId (Element, in 26 feeds)
  - stitcherId (Element, in 12 feeds)
* https://podcastindex.org/namespace/1.0 (in 229 feeds)
  - funding (Element, in 13 feeds)
  - license (Element, in 3 feeds)
  - location (Element, in 3 feeds)
  - locked (Element, in 6 feeds)
  - person (Element, in 95 feeds)
  - transcript (Element, in 9 feeds)
  - value (Element, in 3 feeds)
  - valueRecipient (Element, in 3 feeds)
* https://podlove.de/simple-chapters (in 2 feeds)
  - chapter (Element, in 2 feeds)
  - chapters (Element, in 2 feeds)
* https://podlove.org/simple-chapters (in 4 feeds)
  - chapter (Element, in 4 feeds)
  - chapters (Element, in 4 feeds)
* https://podlove.org/simple-chapters/ (in 26 feeds)
  - chapter (Element, in 5 feeds)
  - chapters (Element, in 5 feeds)
* https://podping.info/specification/1 (in 20 feeds)
  - receiver (Element, in 20 feeds)
* https://purl.org/rss/1.0/modules/content/ (in 3 feeds)
  - encoded (Element, in 3 feeds)
* https://schema-access.acast.com/1.0/ (in 20 feeds)
* https://schema.acast.com/1.0/ (in 58 feeds)
  - episodeId (Element, in 24 feeds)
  - episodeUrl (Element, in 24 feeds)
  - importedFeed (Element, in 4 feeds)
  - network (Element, in 20 feeds)
  - settings (Element, in 32 feeds)
  - showId (Element, in 24 feeds)
  - showUrl (Element, in 22 feeds)
  - signature (Element, in 24 feeds)
* https://www.google.com/schemas/play-podcasts/1.0 (in 1 feeds)
* https://www.itunes.com/dtds/podcast-1.0.dtd (in 5 feeds)
  - author (Element, in 5 feeds)
  - category (Element, in 3 feeds)
  - duration (Element, in 2 feeds)
  - email (Element, in 3 feeds)
  - explicit (Element, in 3 feeds)
  - image (Element, in 5 feeds)
  - keywords (Element, in 2 feeds)
  - name (Element, in 3 feeds)
  - owner (Element, in 3 feeds)
  - subtitle (Element, in 5 feeds)
  - summary (Element, in 5 feeds)
* https://www.npr.org/rss/ (in 6 feeds)
* https://www.rssboard.org/rss-specification (in 1362 feeds)
  - a (Element, in 5 feeds)
  - active (Attribute, in 1 feeds)
  - address (Attribute, in 3 feeds)
  - algorithm (Attribute, in 24 feeds)
  - amazon (Attribute, in 2 feeds)
  - android (Attribute, in 2 feeds)
  - audioId (Element, in 5 feeds)
  - author (Element, in 186 feeds)
  - bitrate (Attribute, in 1 feeds)
  - blockquote (Element, in 1 feeds)
  - blubrry (Attribute, in 10 feeds)
  - body (Element, in 2 feeds)
  - br (Element, in 3 feeds)
  - broadcastlimit (Element, in 6 feeds)
  - category (Element, in 460 feeds)
  - cbcListenUrl (Element, in 1 feeds)
  - channel (Element, in 1360 feeds)
  - channelExportDir (Element, in 5 feeds)
  - cloud (Element, in 7 feeds)
  - code (Attribute, in 1 feeds)
  - comments (Element, in 221 feeds)
  - content (Attribute, in 5 feeds)
  - contentLink (Element, in 1 feeds)
  - copyright (Element, in 921 feeds)
  - day (Element, in 1 feeds)
  - daysLive (Attribute, in 38 feeds)
  - deezer (Attribute, in 4 feeds)
  - description (Element, in 1350 feeds)
  - docs (Element, in 188 feeds)
  - domain (Attribute, in 22 feeds)
  - domain (Element, in 9 feeds)
  - duration (Attribute, in 51 feeds)
  - em (Element, in 2 feeds)
  - email (Attribute, in 7 feeds)
  - enclosure (Element, in 1340 feeds)
  - encoding (Attribute, in 5 feeds)
  - episode_mp3 (Element, in 1 feeds)
  - expression (Attribute, in 39 feeds)
  - fee (Attribute, in 3 feeds)
  - feed (Attribute, in 85 feeds)
  - ffmpeg (Element, in 6 feeds)
  - fileSize (Attribute, in 198 feeds)
  - frequency (Attribute, in 38 feeds)
  - generator (Element, in 898 feeds)
  - geo (Attribute, in 3 feeds)
  - googleplay (Attribute, in 1 feeds)
  - guid (Element, in 1349 feeds)
  - guid (Attribute, in 1 feeds)
  - head (Element, in 2 feeds)
  - height (Element, in 173 feeds)
  - height (Attribute, in 48 feeds)
  - hour (Element, in 2 feeds)
  - href (Attribute, in 1336 feeds)
  - html (Attribute, in 25 feeds)
  - html (Element, in 2 feeds)
  - http-equiv (Attribute, in 2 feeds)
  - id (Attribute, in 58 feeds)
  - iheart (Attribute, in 1 feeds)
  - ilink (Element, in 5 feeds)
  - image (Element, in 1119 feeds)
  - image (Attribute, in 2 feeds)
  - img (Attribute, in 95 feeds)
  - isDefault (Attribute, in 12 feeds)
  - isPermaLink (Attribute, in 1193 feeds)
  - isPermalink (Attribute, in 17 feeds)
  - item (Element, in 1351 feeds)
  - itunes (Attribute, in 80 feeds)
  - ituneslink (Element, in 1 feeds)
  - itunesowner (Element, in 2 feeds)
  - key (Attribute, in 62 feeds)
  - label (Attribute, in 21 feeds)
  - lame (Element, in 6 feeds)
  - lang (Attribute, in 12 feeds)
  - language (Element, in 1353 feeds)
  - lastBuildDate (Element, in 989 feeds)
  - latitude (Element, in 1 feeds)
  - launchDate (Attribute, in 1 feeds)
  - length (Attribute, in 1329 feeds)
  - li (Element, in 1 feeds)
  - link (Element, in 1359 feeds)
  - liveItems (Attribute, in 1 feeds)
  - logo (Element, in 1 feeds)
  - longitude (Element, in 1 feeds)
  - managingEditor (Element, in 364 feeds)
  - managingeditor (Element, in 5 feeds)
  - medium (Attribute, in 96 feeds)
  - meta (Element, in 2 feeds)
  - method (Attribute, in 3 feeds)
  - name (Attribute, in 44 feeds)
  - owner (Attribute, in 8 feeds)
  - p (Element, in 3 feeds)
  - pandora (Attribute, in 1 feeds)
  - path (Attribute, in 7 feeds)
  - port (Attribute, in 7 feeds)
  - position (Element, in 1 feeds)
  - pre (Element, in 1 feeds)
  - protocol (Attribute, in 7 feeds)
  - pubDate (Element, in 1355 feeds)
  - public (Attribute, in 1 feeds)
  - region (Attribute, in 1 feeds)
  - registerProcedure (Attribute, in 7 feeds)
  - rel (Attribute, in 1189 feeds)
  - relationship (Attribute, in 7 feeds)
  - role (Attribute, in 247 feeds)
  - rss (Element, in 1360 feeds)
  - scheme (Attribute, in 242 feeds)
  - size (Attribute, in 1 feeds)
  - skipDays (Element, in 1 feeds)
  - skipHours (Element, in 2 feeds)
  - slug (Attribute, in 1 feeds)
  - source (Element, in 12 feeds)
  - split (Attribute, in 3 feeds)
  - spotify (Attribute, in 8 feeds)
  - src (Attribute, in 48 feeds)
  - start (Attribute, in 187 feeds)
  - status (Attribute, in 12 feeds)
  - stitcher (Attribute, in 9 feeds)
  - strike (Element, in 1 feeds)
  - strong (Element, in 1 feeds)
  - suggested (Attribute, in 3 feeds)
  - systemId (Attribute, in 38 feeds)
  - text (Attribute, in 1301 feeds)
  - title (Element, in 1360 feeds)
  - title (Attribute, in 332 feeds)
  - toPubDate (Element, in 5 feeds)
  - ttl (Element, in 249 feeds)
  - tunein (Attribute, in 9 feeds)
  - tv (Attribute, in 21 feeds)
  - type (Attribute, in 1357 feeds)
  - typicalDuration (Attribute, in 1 feeds)
  - ul (Element, in 1 feeds)
  - uri (Attribute, in 191 feeds)
  - url (Element, in 1115 feeds)
  - url (Attribute, in 1344 feeds)
  - version (Attribute, in 1360 feeds)
  - webMaster (Element, in 201 feeds)
  - webmaster (Element, in 14 feeds)
  - width (Element, in 172 feeds)
  - width (Attribute, in 48 feeds)
* https://www.spotify.com/ns/rss (in 1 feeds)
  - countryOfOrigin (Element, in 1 feeds)
* https://www.w3.org/2005/Atom (in 3 feeds)
  - link (Element, in 3 feeds)
* https://www.w3.org/TR/REC-xml/#syntax (in 2 feeds)

(Usage numbers for the https://www.rssboard.org/rss-specification namespace have to be taken with a grain of salt here, because failures are also recorded as "using" this namespace for now. Every element/attribute without a namespace is assigned to this NS)

Also interesting are the prefixes that are declared for additional namespace:

Atom            http://www.w3.org/2005/Atom
acast           https://schema.acast.com/1.0/
admin           http://webns.net/mvcb/
amp             http://www.adobe.com/amp/1.0
anchor          https://anchor.fm/xmlns
ard             http://www.ard.de/ardNamespace
art19           https://art19.com/xmlns/rss-extensions/1.0
atom            http://www.w3.org/2005/Atom
atom            http://www.w3.org/2005/Atom/
atom            https://www.w3.org/2005/Atom
atom10          http://www.w3.org/2005/Atom
audioboom       https://audioboom.com/rss/1.0
bitlove         http://bitlove.org
blogChannel     http://backend.userland.com/blogChannelModule
blogger         http://schemas.google.com/blogger/2008
cba             https://cba.fro.at/help#feeds
cc              http://web.resource.org/cc/
content         http://purl.org/rss/1.0/modules/content/
content         https://www.w3.org/TR/REC-xml/#syntax
content         https://purl.org/rss/1.0/modules/content/
content         http://purl.org/rss/1.0/modules/content
creativeCommons http://backend.userland.com/creativeCommonsRssModule
dc              http://purl.org/dc/elements/1.1/
dcterms         http://purl.org/dc/terms/
feedburner      http://rssnamespace.org/feedburner/ext/1.0
feedpress       https://feed.press/xmlns
fh              http://purl.org/syndication/history/1.0
fireside        http://fireside.fm/modules/rss/fireside
foaf            http://xmlns.com/foaf/0.1/
frn             http://www.freie-radios.net/namespaces/frn
fyyd            https://fyyd.de/fyyd-ns/
gd              http://schemas.google.com/g/2005
geo             http://www.w3.org/2003/01/geo/wgs84_pos#
georss          http://www.georss.org/georss
googleplay      http://www.google.com/schemas/play-podcasts/1.0
googleplay      http://www.google.com/schemas/play-podcasts/1.0/
googleplay      https://www.google.com/schemas/play-podcasts/1.0
googleplay      http://www.google.com/schemas/play-podcasts/1.0/play-podcasts.xsd
itunes          http://www.itunes.com/dtds/podcast-1.0.dtd
itunes          http://www.itunes.com/DTDs/Podcast-1.0.dtd
itunes          https://www.itunes.com/dtds/podcast-1.0.dtd
itunesu         http://www.itunesu.com/feed
iweb            http://www.apple.com/iweb
jwplayer        http://developer.longtailvideo.com/
media           http://search.yahoo.com/mrss/
media           http://www.rssboard.org/media-rss
media           http://search.yahoo.com/mrss
npr             https://www.npr.org/rss/
nprml           https://api.npr.org/nprml
og              http://ogp.me/ns#
omny            https://omny.fm/rss-extensions
openSearch      http://a9.com/-/spec/opensearchrss/1.0/
pa              http://podcastaddict.com
pingback        https://podping.info/specification/1
podaccess       https://schema-access.acast.com/1.0/
podaccess       https://access.acast.com/schema/1.0/
podcast         https://podcastindex.org/namespace/1.0
podcast         https://github.com/Podcastindex-org/podcast-namespace/blob/main/docs/1.0.md
podcastRF       http://radiofrance.fr/Lancelot/Podcast#
ppg             http://bbc.co.uk/2009/01/ppgRss
psc             http://podlove.org/simple-chapters
psc             https://podlove.org/simple-chapters/
psc             https://podlove.org/simple-chapters
rawvoice        http://www.rawvoice.com/rawvoiceRssModule/
rdf             http://www.w3.org/1999/02/22-rdf-syntax-ns#
rdfs            http://www.w3.org/2000/01/rdf-schema#
related         http://vemedio.com/dtds/atom/related-1.0.dtd
sc              http://podlove.org/simple-chapters
schema          http://schema.org/
sioc            http://rdfs.org/sioc/ns#
sioct           http://rdfs.org/sioc/types#
skos            http://www.w3.org/2004/02/skos/core#
slash           http://purl.org/rss/1.0/modules/slash/
spotify         http://www.spotify.com/ns/rss
spotify         https://www.spotify.com/ns/rss
sy              http://purl.org/rss/1.0/modules/syndication/
taxo            http://purl.org/rss/1.0/modules/taxonomy/
thr             http://purl.org/syndication/thread/1.0
trackback       http://madskills.com/public/xml/rss/module/trackback/
wfw             http://wellformedweb.org/CommentAPI/
xhtml           http://www.w3.org/1999/xhtml
xmlns           http://www.w3.org/2005/Atom
xmlns           com-wordpress:feed-additions:1
xmlns           http://pipes.yahoo.com
xmlns           https://podlove.de/simple-chapters
xsd             http://www.w3.org/2001/XMLSchema#
xsi             http://www.w3.org/2001/XMLSchema-instance
mpgirro commented 3 years ago

It's sad to see that major namespaces are used with a wrong URI (Atom, iTunes, Google Play, RSS 1.0 Content, Podlove Simple Chapters, Media RSS).

I'm thinking if we should do something about this. We already have the capability to recognise several URIs for a namespace, so technically we could add the wrong ones and Stalla could parse these elements as well. On write, we'd use the correct namespace then. This would "fix" broken feeds, but mess with our transparent parse/write policy of course, making it a bad idea I guess.

Alternatively this could be a new feature like #46ModelValidator but for feeds (FeedValidator)?

rock3r commented 3 years ago

Adding to this, I have also found someone else who did a similar analysis a few years back:

https://github.com/mdewilde/podcast-parser/blob/master/corpus-stats

That Java lib also supports a bunch of the namespaces we don't already, it may be worth taking a look at what they deemed worth supporting in terms of DC: image

Full list of their supported NS/attributes here

rock3r commented 3 years ago

It's sad to see that major namespaces are used with a wrong URI (Atom, iTunes, Google Play, RSS 1.0 Content, Podlove Simple Chapters, Media RSS).

I'm thinking if we should do something about this. We already have the capability to recognise several URIs for a namespace, so technically we could add the wrong ones and Stalla could parse these elements as well. On write, we'd use the correct namespace then. This would "fix" broken feeds, but mess with our transparent parse/write policy of course, making it a bad idea I guess.

Alternatively this could be a new feature like #46ModelValidator but for feeds (FeedValidator)?

I would like to maintain the transparency by default. Maybe we could have a special version of parse which takes in some options, including things like "attempt to repair namespaces"? Maybe even taking in a parsing pre-processor, which can manipulate the feed DOM before it's parsed. This way it could be relatively easy to inspect and fix namespaces.

mpgirro commented 3 years ago

I would like to maintain the transparency by default. Maybe we could have a special version of parse which takes in some options, including things like "attempt to repair namespaces"? Maybe even taking in a parsing pre-processor, which can manipulate the feed DOM before it's parsed. This way it could be relatively easy to inspect and fix namespaces.

I like this idea

mpgirro commented 3 years ago

Adding to this, I have also found someone else who did a similar analysis a few years back: https://github.com/mdewilde/podcast-parser/blob/master/corpus-stats That Java lib also supports a bunch of the namespaces we don't already, it may be worth taking a look at what they deemed worth supporting in terms of DC

Interesting. They've found some namespace I haven't encountered yet, but their data set is also much larger. Will try to get more our of the queried directories in the future.

rock3r commented 3 years ago

Adding to this, I have also found someone else who did a similar analysis a few years back: https://github.com/mdewilde/podcast-parser/blob/master/corpus-stats That Java lib also supports a bunch of the namespaces we don't already, it may be worth taking a look at what they deemed worth supporting in terms of DC

Interesting. They've found some namespace I haven't encountered yet, but their data set is also much larger. Will try to get more our of the queried directories in the future.

It was just a lucky coincidence I was looking at this as you opened this issue ahahah

rock3r commented 3 years ago

By the way, the author of that lib seems to have a few interesting repos we may look at. Mostly, for this issue, a Java application for finding podcast feed URLs: https://github.com/mdewilde/podcastfinder

mpgirro commented 3 years ago

Oh boy, there is a problem in the recording I think. All attributes are either assigned to the http://www.w3.org/2000/xmlns/ or the https://www.rssboard.org/rss-specification namespace. Damn it...

Will add channel/item distinction as well.

mpgirro commented 3 years ago

Reworked the scrapper a bit and here are some new results. This time I used this gist with ~64k unique feeds scrapped from iTunes. Pre-filtered them yesterday and just used the ones actually reachable (~28k). Results are too large to post them here as text, so I'm appending the various output formats in an archive: 20210415_014430.zip

XHTML is now ignored if declared correctly to improve readability, but tons of feeds are just kaput, making the result rather hard to read for a large input set.

Still need to give the podcastfinder tool a try.

mpgirro commented 3 years ago

Used the podcastfinder and added the produced feeds to the previous list I had. New results are based on 45550 successfully processed feeds. Full results are here: 20210420_024820.zip (including more and improved output formats).

I'll post more detailed observations in the respective issues of the namespace in the next few days.

For now, here is a list of namespaces that are declared in at least 0,5% of all processed feeds:

96,6%    http://www.itunes.com/dtds/podcast-1.0.dtd
85,6%    http://www.w3.org/2005/Atom
54,5%    http://purl.org/rss/1.0/modules/content/
39,8%    http://search.yahoo.com/mrss/
37,8%    http://purl.org/dc/elements/1.1/
27,8%    http://www.google.com/schemas/play-podcasts/1.0
24,6%    http://wellformedweb.org/CommentAPI/
20,4%    http://rssnamespace.org/feedburner/ext/1.0
12,9%    http://purl.org/rss/1.0/modules/syndication/
12,0%    http://purl.org/rss/1.0/modules/slash/
 9,0%    http://purl.org/dc/terms/
 8,8%    https://podcastindex.org/namespace/1.0
 7,7%    http://www.w3.org/1999/02/22-rdf-syntax-ns#
 7,1%    http://web.resource.org/cc/
 7,0%    https://anchor.fm/xmlns
 6,1%    http://www.rawvoice.com/rawvoiceRssModule/
 5,5%    http://www.georss.org/georss
 5,1%    http://www.w3.org/2003/01/geo/wgs84_pos#
 4,3%    http://backend.userland.com/creativeCommonsRssModule
 4,2%    http://a9.com/-/spec/opensearchrss/1.0/
 3,9%    http://purl.org/syndication/thread/1.0
 3,4%    http://www.spotify.com/ns/rss
 3,3%    https://github.com/Podcastindex-org/podcast-namespace/blob/main/docs/1.0.md
 3,0%    https://schema.acast.com/1.0/
 2,3%    http://www.w3.org/XML/1998/namespace
 1,9%    com-wordpress:feed-additions:1
 1,7%    http://www.itunes.com/DTDs/Podcast-1.0.dtd
 1,6%    https://podlove.org/simple-chapters/
 1,5%    http://schemas.google.com/blogger/2008
 1,5%    http://schemas.google.com/g/2005
 1,5%    https://omny.fm/rss-extensions
 1,4%    http://podlove.org/simple-chapters
 1,2%    https://podping.info/specification/1
 1,2%    https://schema-access.acast.com/1.0/
 1,2%    http://purl.org/syndication/history/1.0
 0,9%    http://bbc.co.uk/2009/01/ppgRss
 0,7%    https://art19.com/xmlns/rss-extensions/1.0
 0,7%    http://www.google.com/schemas/play-podcasts/1.0/
 0,5%    http://www.rssboard.org/media-rss

Namespaces we do not yet support yet and that have at least one element/attribute declared in the <channel> or an <item> of at least one feed (ordered by namespace frequency):

http://search.yahoo.com/mrss/
http://purl.org/dc/elements/1.1/
http://wellformedweb.org/CommentAPI/
http://rssnamespace.org/feedburner/ext/1.0
http://purl.org/rss/1.0/modules/syndication/
http://purl.org/rss/1.0/modules/slash/
http://purl.org/dc/terms/
http://www.w3.org/1999/02/22-rdf-syntax-ns#
https://anchor.fm/xmlns
http://www.rawvoice.com/rawvoiceRssModule/
http://www.georss.org/georss
http://www.w3.org/2003/01/geo/wgs84_pos#
http://backend.userland.com/creativeCommonsRssModule
http://a9.com/-/spec/opensearchrss/1.0/
http://purl.org/syndication/thread/1.0
http://www.spotify.com/ns/rss
https://schema.acast.com/1.0/
http://www.w3.org/XML/1998/namespace
com-wordpress:feed-additions:1
http://schemas.google.com/blogger/2008
http://schemas.google.com/g/2005
https://omny.fm/rss-extensions
https://podping.info/specification/1
https://schema-access.acast.com/1.0/
http://bbc.co.uk/2009/01/ppgRss

But the actual element/attribute usage of these namespaces is <0,1%

http://www.w3.org/1999/02/22-rdf-syntax-ns#
http://schemas.google.com/blogger/2008
http://schemas.google.com/g/2005
https://schema-access.acast.com/1.0/

Some general observations:

rock3r commented 3 years ago

Thanks for the analysis! It's pretty terrible to see how many errors there are in feeds. It's a good thing we built in tolerance to malformed namespaces...

It's also pretty interesting to see that there is essentially no usage of the Spotify tags. Not even Spotify's own Gimlet podcasts use them.

mpgirro commented 3 years ago

Yes, the Spotify tags also surprised me.

Note however, that these results are now extremely US/English feeds centered, because the podcastfinder has these settings hardcoded...

I've noticed that for smaller datasets (based on "large" Fyyd/Panoptikum results that have way more german content), the namespace frequency looks quite different for namespaces that are in the 1,X% range in these results (e.g. Podlove Simple Chapters and Feed History are very high because in the German speaking area the Podlove Publisher CMS is extremely popular).

At some point I'll integrate the internal API of podcastfinder into the scrapper to have better backing data. For now I think it's worth to also pay some attention to the namespaces that are further down in our results, and check if there are some useful specifications hidden in there (e.g. #84)

rock3r commented 3 years ago

What's still left to do on this one? Just the podcastfinder API integration?

mpgirro commented 3 years ago

Right now there is on the table:

I'm also not fully done with studying the result data yet, and unfortunately I won't be able to make much time for another 2-3 weeks.

mpgirro commented 3 years ago

Do you wanna have access to the repo @rock3r? If you wan't to pick this up, or have something additional to add.

rock3r commented 3 years ago

Sure, although I'll be focussing on getting 1.1.0 out the door first :)

mpgirro commented 3 years ago

I've added you to the repo :)