kurtmckee / feedparser

Parse feeds in Python
https://feedparser.readthedocs.io/en/latest/
Other
1.89k stars 336 forks source link

Can't parse multiple elements of an <entry> with the same name #435

Open wsanders opened 3 months ago

wsanders commented 3 months ago

NOAA publishes an ATOM feed of their weather alerts (example): noaa-sample-event.txt

"Encapsulated" in this feed is XML-reformatted data in a spec called CAP, based on JSON. The CAP entries in the Atom XML show up in a particular way, with colons in the tag: <cap:event>Wind Advisory</cap:event> At the entry level, feedparser handles these in a predictable way: {'id': .... .... 'cap_event': 'Wind Advisory', .... etc

However, the NOAA entries include multiple instances of a "cap:parameter" item: <cap:parameter> <valueName>somename</valueName> <value>somevalue</value> </cap:parameter> <cap:parameter> <valueName>someothername</valueName> <value>someothervalue</value> </cap:parameter>

Feedburner's JSON only includes the last cap:parameter's valueName and value in the list, followed by a null cap_parameter: {'id':........ 'valuename': 'eventEndingTime', 'value': '2024-03-30T12:00:00+00:00', 'cap_geocode': '', 'cap_parameter': ''}

I don't know much about ATOM, so I don't know if this is a real issue of if the NOAA ATOM is nonstandard in some way.

I would expect output in something like the CAP format JSON: "parameters": { "somename": [ "somevalue" ], "someothername": [ "someothervalue", "possibly a list etc", ], etc

The workaround is to extract the URL of the CAP data that is part of the ATOM feed, from that you get useful JSON, but it's not in ATOM format.

pishposhmcgee commented 1 month ago

I am also seeing this issue for use with transcripts in podcast feeds. The Podcastindex specification allows for multiple of these entries for different formats of transcript. Feedparser seems to load each entry and overwrite any existing, with the effect being that the last entry is what is presented.

pishposhmcgee commented 1 month ago

With more searching this also seems to be related to #297 and #301