Dublin Core namespace support

mpgirro / stalla

A Kotlin and Java library for RSS podcast feeds

https://stalla.dev

BSD 3-Clause "New" or "Revised" License

26 stars 5 forks source link

Dublin Core namespace support #28

Open mpgirro opened 3 years ago

mpgirro commented 3 years ago

dc:creator seems to be the most used element from what I can remember. More research needed

mpgirro commented 3 years ago

Specification: https://www.dublincore.org/specifications/dublin-core/dcmi-terms/#

rock3r commented 3 years ago

How do we collect data about the tags that are used, and we need to support? There's 15 in the specs, that's a lot :)

mpgirro commented 3 years ago

For starters, we can stick to the gPodder recommendation: https://github.com/gpodder/podcast-feed-best-practice/blob/master/podcast-feed-best-practice.md

The questions which tags are actually used is a good one though. Some time ago I had the idea of writing a little tool that just reads a lot of feeds and makes a statistic about the used namespaces and tags per namespace. This could be pretty useful, also for other namespaces we might not have considered yet.

rock3r commented 3 years ago

I was thinking about a little scraper as well, we could maybe point it at some podcast charts and collect them. If you're up for that, having an idea of which ones are actually used may give us a priority list for the implementation

rock3r commented 3 years ago

According to the GPodder document, the DC 1.1 elements we should support are:

`<channel>`	`<item>`
`<dc:title>` `<dc:description>` `<dc:creator>` `<dc:contributor>` `<dc:publisher>` `<dc:subject>` `<dc:language>` `<dc:date>`	`<dc:title>` `<dc:description>` `<dc:creator>` `<dc:contributor>` `<dc:publisher>` `<dc:subject>`

The tags for <item>s are a subset of the ones applied to <channel> (you don't have dc:language and dc:date)

rock3r commented 3 years ago

On top of that, there may be a few DC terms items which we may want to support, but I don't really know which ones are even used. Probably the scraper can help us there.

mpgirro commented 3 years ago

Thanks for doing the analysis of the gPodder document. I agree that the scraper would be really useful regarding the terms. Hope I can make some time the next few days and give it a try.

mpgirro commented 3 years ago

Update on the scraper: Have rigged up a prototype. Should be easy to extend from this point on to get all the info we want. Expect some first results in the next few days.

rock3r commented 3 years ago

Great to know! Thanks for tackling this :)

mpgirro commented 3 years ago

Analysis results show the following DC elements occurrences:

`<channel>`	`<item>`	`<image>`
creator (0,3%)	creator (27,7%)	creator (1 feed)
date (0,2%)	date (0,3%)	date (1 feed)
rights (0,1%)	rights (< 0,1%)
language (0,1%)	language (< 0,1%)	language (1 feed)
title (< 0,1%)	title (< 0,1%)	title (1 feed)
subject (< 0,1%)	subject (0,1%)	subject (1 feed)
description (< 0,1%)	description (< 0,1%)	description (1 feed)
contributor (< 0,1%)	contributor (< 0,1%)
publisher (< 0,1%)	publisher (< 0,1%)	publisher (1 feed)
coverage (< 0,1%)
type (< 0,1%)	type (< 0,1%)
	format (< 0,1%)	format (1 feed)
	date.Taken (< 0,1%)
	modified (< 0,1%)
	identifier (< 0,1%)
	contributor (< 0,1%)
	source (< 0,1%)

I guess we can ignore the one <image> occurrence 😄