Feed content parsing truncation (podcast.hernancattaneo.com)

Thanks for flagging this. Taking a look at the XML file I can immediately see what the issue is -- it's getting choked up on the ampersands (&). In HTML/XML, ampersands can be used to designate the start of a special HTML entity (e.g., “ and ” give you curly double quotes). The problem is that depending on what program you use to create a podcast, it could be inserting the special character directly, or using these HTML entity codes, or perhaps even some mix of the two. Browsers have gotten much more flexible with Unicode characters over the years, but the flexibility means that HTML parsing ends up being a pain.

Anyway, shortly before releasing v1.0 I had added a small library to parse these HTML entity codes and convert them to the Unicode equivalent, but I'll admit I didn't do as much testing with that library as I should have. It looks like it's just getting choked up on plain ampersands that don't designate the start of a special entity. I can add a quick fix in the next week or so that should fix most of this behaviour, but I'll have to think about whether to add a more complete solution to catch more of the edge cases where you have a mix of some direct Unicode and some entity codes. But I can likely get the quick solution out within the next week, so keep an eye out for a v1.0.1. There are a couple other little bugs that people have pointed out, so I will likely round up all of those for a patch release.

jeff-hughes / shellcaster

Feed content parsing truncation (podcast.hernancattaneo.com) #12