mischov / meeseeks

An Elixir library for parsing and extracting data from HTML and XML with CSS or XPath selectors.
MIT License
316 stars 23 forks source link

Meeseeks.tree doesn't recognize <link> tag #102

Closed pratos closed 4 years ago

pratos commented 4 years ago

Thanks for this wonderful library! I'm very new to Elixir and OTP and it has been a great experience so far.

I'm parsing RSS feeds from a few links (one example is this url). All the rest of the tags are being recognized properly, but the <link> tag seems to be off.

Example xml tree:

<item>
    <title>Every Premier League squad profiled</title>
    <link>https://theathletic.com/podcast/145-zonal-marking/?episode=43</link>
    <description>The Athletic's Michael Cox and Tom Worville join Ali Maxwell to profile every Premier League squad by average age, depth of quality and numerous other quirks along the way. Who is well-stocked across the pitch? Who are the unsung 
   ...
</item>

The equivalent Meeseeks.tree:

iex|48|▶▶ Meeseeks.tree(first)                                                                                      
{"item", [],
 [
   "\n      ",
   {"guid", [{"ispermalink", "false"}],
    ["tag:soundcloud,2010:tracks/867064399"]},
   "\n      ",
   {"title", [], ["StatsBomb Podcast: PL Wrap Up July 30th 2020"]},
   "\n      ",
   {"pubdate", [], ["Thu, 30 Jul 2020 11:02:31 +0000"]},
   "\n      ",
   {"link", [], []},  ====> Link tag
   "https://soundcloud.com/statsbomb-pod/statsbomb-podcast-pl-wrap-up-july-30th-2020\n      ", ==> URL out of the link tag
  ....
 ]}

Trying to find the link tag via Meeseeks give empty value. The URL inside seems to be identified as a separate element in the xml tree by the parser. I'm not sure if I'm doing it right.

Also, I tried to extract the same feed with Floki and could get the link properly.

Here's the link to the gist to replicate the same.

pratos commented 4 years ago

Gee, I think I got the issue. I looked around Elixir forums and found this thread.

Need to add :xml to Meeseeks.parse() 😬