google-code-export / feedparser

Automatically exported from code.google.com/p/feedparser
Other
1 stars 0 forks source link

Support for WordPress export format namespace #425

Open GoogleCodeExporter opened 9 years ago

GoogleCodeExporter commented 9 years ago
What steps will reproduce the problem?

1. Take any exported RSS feed from a WordPress blog (Admin > Tools > Export)
2. Run it through feedparser
3. Try and extract custom WordPress tags (prefixed 'wp:') that appear to 
'clash' with feedparser (even though they are prefixed/namespaced with 'wp:'). 
Examples that are ignored (tags that also have child elements):

<wp:author>
<wp:category>

Examples that are parsed correctly:

<wp:wxr_version>
<wp:base_site_url>
<wp:base_blog_url>

What is the expected output? What do you see instead?

I would expect to see:

feed.wp_author = []
feed.wp_author[i].author_id = 1
feed.wp_author[i].author_login = 'admin'
feed.wp_author[i].author_email = 'foo@bar.com'
feed.wp_author[i].author_display_name = 'Foo Bar'
feed.wp_author[i].author_first_name = 'Foo'
feed.wp_author[i].author_last_name = 'Bar'

What version of the product are you using? On what operating system?

feedparser 5.1.3
Python 2.7.5
Mac OSX 10.9.3

Please provide any additional information below.

If this is beyond the scope of feedparser (ie. Word Press is not considered a 
valid RSS feed format) then please have some generic documentation on the 
feedparser site that lists how to monkey patch the feedparser code. In 
particular, how to deal with entries that need to have lists, such as the 
wp_author element listed in this issue.

Thanks in advance.

Original issue reported on code.google.com by robertln...@gmail.com on 13 Jun 2014 at 5:01

GoogleCodeExporter commented 9 years ago
I have monkey patched feedparser in the following way for the <wp:author> list:

def _start_wp_author(self, attrsD):
    context = self._getContext()
    context.setdefault('wp_authors', [])
    context['wp_authors'].append(feedparser.FeedParserDict())

def _start_wp_author_id(self, attrsD):
    context = self._getContext()
    context.setdefault('wp_author_id', [])
    self.push('wp_author_id', 1) # new
    context['wp_author_id'].append(attrsD)

def _end_wp_author_id(self):
    wp_author_id = self.pop('wp_author_id')
    context = self._getContext()
    context['wp_authors'][-1]['wp_author_id'] = wp_author_id

def _start_wp_author_login(self,attrsD):
    context = self._getContext()
    context.setdefault('wp_author_login', [])
    self.push('wp_author_login', 1) # new
    context['wp_author_login'].append(attrsD)

def _end_wp_author_login(self):
    wp_author_login = self.pop('wp_author_login')
    context = self._getContext()
    context['wp_authors'][-1]['wp_author_login'] = wp_author_login

def _start_wp_author_email(self,attrsD):
    context = self._getContext()
    context.setdefault('wp_author_email', [])
    self.push('wp_author_email', 1) # new
    context['wp_author_email'].append(attrsD)

def _end_wp_author_email(self):
    wp_author_email = self.pop('wp_author_email')
    context = self._getContext()
    context['wp_authors'][-1]['wp_author_email'] = wp_author_email

def _start_wp_author_display_name(self,attrsD):
    context = self._getContext()
    context.setdefault('wp_author_display_name', [])
    self.push('wp_author_display_name', 1) # new
    context['wp_author_display_name'].append(attrsD)

def _end_wp_author_display_name(self):
    wp_author_display_name = self.pop('wp_author_display_name')
    context = self._getContext()
    context['wp_authors'][-1]['wp_author_display_name'] = wp_author_display_name

def _start_wp_author_first_name(self,attrsD):
    context = self._getContext()
    context.setdefault('wp_author_first_name', [])
    self.push('wp_author_first_name', 1) # new
    context['wp_author_first_name'].append(attrsD)

def _end_wp_author_first_name(self):
    wp_author_first_name = self.pop('wp_author_first_name')
    context = self._getContext()
    context['wp_authors'][-1]['wp_author_first_name'] = wp_author_first_name

def _start_wp_author_last_name(self,attrsD):
    context = self._getContext()
    context.setdefault('wp_author_last_name', [])
    self.push('wp_author_last_name', 1) # new
    context['wp_author_last_name'].append(attrsD)

def _end_wp_author_last_name(self):
    wp_author_last_name = self.pop('wp_author_last_name')
    context = self._getContext()
    context['wp_authors'][-1]['wp_author_last_name'] = wp_author_last_name

feedparser._FeedParserMixin._start_wp_author = _start_wp_author
feedparser._FeedParserMixin._start_wp_author_id = _start_wp_author_id
feedparser._FeedParserMixin._end_wp_author_id = _end_wp_author_id
feedparser._FeedParserMixin._start_wp_author_login = _start_wp_author_login
feedparser._FeedParserMixin._end_wp_author_login = _end_wp_author_login
feedparser._FeedParserMixin._start_wp_author_email = _start_wp_author_email
feedparser._FeedParserMixin._end_wp_author_email = _end_wp_author_email
feedparser._FeedParserMixin._start_wp_author_display_name = 
_start_wp_author_display_name
feedparser._FeedParserMixin._end_wp_author_display_name = 
_end_wp_author_display_name
feedparser._FeedParserMixin._start_wp_author_first_name = 
_start_wp_author_first_name
feedparser._FeedParserMixin._end_wp_author_first_name = 
_end_wp_author_first_name
feedparser._FeedParserMixin._start_wp_author_last_name = 
_start_wp_author_last_name
feedparser._FeedParserMixin._end_wp_author_last_name = _end_wp_author_last_name

This seems highly verbose, but I am making an educated guess without any 
feedparser Mixin docs and based on other tickets filed. Is this correct? Should 
I be doing this for all Word Press <wp:element_name> tags if I want them to not 
get over-written by feedparser's default behavior?

Thanks in advance.

Original comment by robertln...@gmail.com on 13 Jun 2014 at 6:21

GoogleCodeExporter commented 9 years ago
[deleted comment]
GoogleCodeExporter commented 9 years ago
Based on the code above, I now get a Python list returned, which is what I want:

wp_authors = [
    {
        'wp_author_first_name': u'Foo',
        'wp_author_email': u'foo@bar.com',
        'wp_author_display_name': u'Foo Bar',
        'wp_author_login': u'admin',
        'wp_author_last_name': u'Bar',
        'wp_author_id': u'1'
     }
]

(Note this is different to my original request which was to have them in 
feed.wp_author. Now they are stored in feed.wp_authors, which actually makes 
more semantic sense.)

I can print the list of dictionaries out like this:

feed.wp_authors

So I think I am happy. Although it would be nice to have some idea if my monkey 
patching is on the right lines....

Original comment by robertln...@gmail.com on 13 Jun 2014 at 6:27

GoogleCodeExporter commented 9 years ago
You're absolutely correct, that is currently the best way to handle 
monkey-patching!

I'll look at the Wordpress export format namespace, though my cursory glance 
for documentation suggests that I'll have to look through the Wordpress code 
itself.

Quick confirmation: the namespace URI is http://wordpress.org/export/1.0/ 
correct?

Original comment by kurtmckee on 10 Jul 2014 at 2:03

GoogleCodeExporter commented 9 years ago
[deleted comment]
GoogleCodeExporter commented 9 years ago
Hi Kurt,

Thanks for the feedback. The most recent namespace URI is actually:

http://wordpress.org/export/1.2/

(as of Word Press v3.9.1)

The namespace URI depends on which version of Word Press you have installed 
(i.e. an old version of WP won't point to the latest namespace URI because all 
the export functionality is a hard-coded function).

Thanks for the help. The feedparser is an awesome library.

Original comment by robertln...@gmail.com on 10 Jul 2014 at 4:01