rubycocos / feedparser

feedparser gem - (universal) web feed parser and normalizer (XML w/ Atom or RSS, JSON Feed, HTML w/ Microformats e.g. h-entry/h-feed or Feed.HTML, Feed.TXT w/ YAML, JSON or INI & Markdown, etc.)
Creative Commons Zero v1.0 Universal
164 stars 11 forks source link

Add oga parsing to generate media attachement fields #10

Closed onli closed 4 years ago

onli commented 4 years ago

This is an unfinished proposal. Missing testcases, awaiting discussion/approval.

As mentioned in https://github.com/feedparser/feedparser/issues/4, feedparser is currently ignoring the yahoo media: items, which are relevant since Youtube is using them to transport meta information. See https://github.com/pipes-digital/pipes/issues/64 for an example use case.

Sadly you were correct in your last comment back then: The issue is that the RSS module is not transporting the namespace elements, at least as far as I could see.

Now, this code does three things:

  1. It extends the attachment object for the elements youtube uses
  2. It adds Oga as dependency, which I'd propose as an easier to install alternative to nokogiri.
  3. It then uses Oga to parse the raw XML to set the additional attachment fields.

Note that I did not see a straightforward way how to use the community field and subfields, and the thumbnail has additional information like height and width that should be preserved (in a hash? an object?).

What do you think, should we go forward with that? I did test only the ATOM codepath so far (because of Youtube being ATOM) and it worked fine.

geraldb commented 4 years ago

Sorry for the delay - the plan is to eventually replace the std rss parser - with nokogiri or oga. It's not happening because I have many other projects. Anyways, instead of a "full rewrite" that might never materialize - your idea of passing in a the raw feed is a good start and start small and step by step. Feel free to merge. Please continue, Keep it up.

About:

thumbnail has additional information like height and width that should be preserved (in a hash? an object?).

Usually the idea is to use structs (objects) thus, add new "inline" fields thumbnail_height, etc or new structs (object) and only as fallback (or as a quick start) the access to the raw hash - but the idea is to have always both (raw and "normalized/unified" structs).

onli commented 4 years ago

Okay. I'm gonna add testcases and complete the PR :)

onli commented 4 years ago

Okay. I added testcases, which was good because it highlighted I had misunderstood the spec, which btw Youtube ignores - luckily in a way that we can catch without colliding with the spec. I did not find a live RSS media feed, so I used an offline example.

Since I will need you in any case for making a release out of this if possible, I will let you press the merge button :)

geraldb commented 4 years ago

Thanks for the initiative. Looks great. Happy to merge.

For pushing a new gem unfortunately I have to figure out how deal with the new oga native c-extension dependency - the fastest way might be to keep it optional for now (and only if you require and install oga you will get the new media support - works like the html microformats support).

I know it is kind of late for a discussion - do you have a strong preference for oga? The repo says the author has not much time for updates, thus, if we already require a c-extension why not go it all the way and use nokogiri (it is already a optional dependeny via microformats), thus, this questions.

Let me know your thoughts - anyways - no worries (as plan b we can use oga as is and than move to nokogiri - not sure what the most popular feedreader, that is, feedjira uses? or depends on? might be a good idea to just use the same to make it easier to copy stuff over (itunes, etc. ) that is now unsupported. Anyways, as said we can split the thing in two tracks (fast track org now and than clean-up / change later or maybe nokogiri from the beginning).

onli commented 4 years ago

I prefer Oga here exactly because it is so much easier to install. It doesn't have the system requirements Nokogiri has. https://gitlab.com/yorickpeterse/oga#why-another-htmlxml-parser goes into that a bit.

It's really easy to swap out if you prefer (and no offense taken if you require that, I can change this PR to Nokogiri), but I think Oga is the best way to minimize issues here. Is the C-extension an issue when publishing the gem? I honestly thought it won't have an impact.

I'd also volunteer to port the microformat code over from nokogiri to oga, if you want to use only one xml parser in the gem and that makes this change easier?

Feedjira is using sax-machine which uses nokogiri. Very clever, but very different approach.

geraldb commented 4 years ago

No worries. Thanks for the update and your thoughts. Let's start with oga. I will only change the gemspec and make it optional, that is, you have to install it "by hand" or add it to the Gemfile and in the code I will add if defined?( Orga ) or such. Please, allow a couple of days to get around later this week to push out a new gem version.

The new / updated parser machinery might be a project for this summer (at the earliest).

geraldb commented 4 years ago

@onli FYI: I made the oga gem in a soft dependency (that is, you have to require 'oga' to turn on your new media attachment machinery) and pushed a new gem version, that is, v2.2.0. See https://github.com/feedparser/feedparser/commit/e9aca1886007f8f67d12c87cf924a361306089dc for the changes and for the gem https://rubygems.org/gems/feedparser. Thanks again for adding the new media attachement fields.

onli commented 4 years ago

Okay, thank you as well