lightpohl / podcast-dl

A humble CLI for downloading and archiving podcasts.
MIT License
412 stars 31 forks source link

Handling problematic feeds #53

Closed calebj closed 1 year ago

calebj commented 1 year ago

I've got one podcast with episode titles that exceed the filename length when .tmp is added by getTempPath() in util.js. Unfortunately I don't control the podcast, so I can't set them straight on the point of a title. I wanted to see how you might want to tackle this first issue, because it would also impact the .meta.json generation and possibly existing downloads.

From what I can tell, filenameify only looks at the last extension, so .meta.json would be cut to .json with an offset of 5, which would cause different prefixes to be generated for templates that render to a length of more than 245. Simply changing the length might cause differences in generated filenames for lengths on the edge, and cause some users to redownload episodes if not using an archive. I don't know what the best solution is here.


I have another podcast--who I have emailed a few times about this issue to no avail--with invalid URIs in the <link> field of some episodes. Since the exception this triggers is unhandled, it crashes the whole run:

TypeError [ERR_INVALID_URL]: Invalid URL
    at new NodeError (node:internal/errors:399:5)
    at new URL (node:internal/url:560:13)
    at getUrlExt (file:///home/caleb/.local/lib/node_modules/podcast-dl/bin/util.js:327:24)
    at getIsAudioUrl (file:///home/caleb/.local/lib/node_modules/podcast-dl/bin/util.js:352:15)
    at getEpisodeAudioUrlAndExt (file:///home/caleb/.local/lib/node_modules/podcast-dl/bin/util.js:362:15)
    at getItemsToDownload (file:///home/caleb/.local/lib/node_modules/podcast-dl/bin/util.js:148:7)
    at main (file:///home/caleb/.local/lib/node_modules/podcast-dl/bin/bin.js:172:23)
    at process.processTicksAndRejections (node:internal/process/task_queues:95:5) {
  input: 'https://%20',
  code: 'ERR_INVALID_URL'
}

I was thinking of wrapping the initial check in a try/catch and returning false if it fails:

const getIsAudioUrl = (url) => {
  try {
    const ext = getUrlExt(url);
  } catch (err) {
    // could log a warning here?
    return false;
  }

  if (!ext) {
    return false;
  }

  return VALID_AUDIO_EXTS.includes(ext);
};

The logic being that if it fails to parse, it can't reliably be checked for a valid audio extension, so it certainly is not an audio URL. I can PR this if you agree.


Finally, I have a third podcast with a CDN that rejects HEAD requests, but only for the tracking URLs. Stripping off the tracking redirect works fine. Rather than an option to disable HEAD, I was thinking of adding an option for a URL filter script that can be called to pre-process the audio links before they are downloaded, sort of like how ripgrep has a preprocessor command option. I have not decided if it would be better to pipe URLs through as lines, or call it once per URL. The experimental tracking bypass option didn't really work, but a custom script would be able to tackle it on a per-feed basis. This probably deserves its own issue.

lightpohl commented 1 year ago

Hey! Thanks for the writeup!

It's not ideal, but for the name issue would it make sense to update the episode template for that particular podcast? The full title would still exist in the .meta.json if you need to search for it, but you'd get a shorter title that will work with your file system. Something like:

--episode-template "{{episode_num}}" --episode-digits 3

I would more than happily accept PRs for 2 and 3 if you have the time! I'm also happy to tackle 2 if you'd prefer to focus on 3. Just let me know!

calebj commented 1 year ago

OK, I downloaded those episodes with just an episode number and they're listed in the archive. Hopefully shouldn't be a problem again. This isn't a great workaround because the titles do have episode numbers in them. They just have an unreasonable stack of headlines crammed in there too.

I did find an actual bug, though. If the media file is 0 bytes, then the Unable to write to file. Suggestion: verify permissions message appears. The length attribute in the enclosure is also 0, so why such an entry is even in the feed is beyond me.

I'll PR 2 and play with some ideas for 3 πŸ‘πŸΌ

lightpohl commented 1 year ago

Updated parsing is available in v8.0.8!

calebj commented 1 year ago

Found some additional issues:

lightpohl commented 1 year ago

Thanks for the report!

Removing the length cap or at least bumping it to a much larger number makes sense for the archive.

I have an idea for how to handle the second issue. Let me grab some time this week and see how it goes. πŸ‘

lightpohl commented 1 year ago

Both items should be resolved/fixable in the latest release! Checkout out the changlog for details.

calebj commented 1 year ago

Unfortunately, the enclosure/link issue is not fixed. At first, it still didn't use the enclosure link, and I fixed that by passing episodeSourceOrder to getEpisodeAudioUrlAndExt() in downloadItemsAsync(). But now, the link check appeared to be broken. It claimed the media link could not be found for the all my other feeds.

I traced this secondary issue to the default argument for the --episode-sort-order option, which remains a string since the default isn't run through the parser. When it iterates over the default argument in getEpisodeAudioUrlAndExt(), it iterates over individual characters in the string, and so has no matching condition. I don't know how you want to resolve this, because changing it to [AUDIO_ORDER_TYPES.enclosure, AUDIO_ORDER_TYPES.link] renders it as

--episode-source-order <string>  attempted order to extract episode audio URL from rss feed (default: ["enclosure","link"])

in the help output. Plugging in this array as the default or providing one every time fixes the issue, though.

This is my changeset:

diff --git a/bin/async.js b/bin/async.js
index 2758b87..bd8830d 100644
--- a/bin/async.js
+++ b/bin/async.js
@@ -162,6 +162,7 @@ let downloadItemsAsync = async ({
   mono,
   override,
   targetItems,
+  episodeSourceOrder,
   threads = 1,
 }) => {
   let numEpisodesDownloaded = 0;
@@ -173,7 +174,7 @@ let downloadItemsAsync = async ({
     const marker = threads > 1 ? `[${threadIndex}] ${item.title}` : item.title;
     const logMessage = getLogMessageWithMarker(marker);
     const { url: episodeAudioUrl, ext: audioFileExt } =
-      getEpisodeAudioUrlAndExt(item);
+      getEpisodeAudioUrlAndExt(item, episodeSourceOrder);

     if (!episodeAudioUrl) {
       hasErrors = true;
diff --git a/bin/bin.js b/bin/bin.js
old mode 100644
new mode 100755
index c05ecea..108300a
--- a/bin/bin.js
+++ b/bin/bin.js
@@ -212,6 +212,7 @@ const main = async () => {
     mono,
     override,
     targetItems,
+    episodeSourceOrder,
     threads,
   });

diff --git a/bin/commander.js b/bin/commander.js
index 20c1c3c..b43050a 100644
--- a/bin/commander.js
+++ b/bin/commander.js
@@ -41,7 +41,7 @@ export const setupCommander = (commander, argv) => {

         return parsed;
       },
-      "enclosure,link"
+      [AUDIO_ORDER_TYPES.enclosure, AUDIO_ORDER_TYPES.link]
     )
     .option("--include-meta", "write out podcast metadata to json")
     .option(

As for the archive key collision issue, it appears resolved. I made a temporary tweak to download() to call writeToArchive() on existing files so they'd carry over with the new key format.

Filenames with .tmp added are still ending up too long which causes an ENAMETOOLONG error on download, so that issue still remains. I would suggest using getSafeName() to get a filename base that is maximum 245 characters long, so it can handle .mp3.tmp and .meta.json, and give both a matching prefix.

However, in my specific case this may not be enough. I seem to have discovered yet another edge case where multi-byte characters are counted as one by filenamify, but are running into the 255 byte filename limit on ZFS. Significantly lowering the filename length may be a non-starter, but adding a CLI option for it may be a reasonable compromise.

To be clear, these are two separate issues, as getTempPath() does no length checking with the extra 4 characters, and filesystems do vary in their filename limits and how they count characters.

lightpohl commented 1 year ago

Thanks for the report! I updated filenamify to the latest version and adjusted the default limit to account for .tmp files. Would you mind giving v9.0.1 a spin and see how it goes?

I can add a way to override the OS file title limit in another update if you're still stuck on those feeds. πŸ‘

calebj commented 1 year ago

Thanks! The update to filenamify caused the truncated filenames to have the correct extension, except for .meta.json, which is reduced to just .json, and the title before it is reduced by one character compared to its mp3 counterpart. The multi-byte characters also still cause an ENAMETOOLONG, since they still go over 255 with .tmp.

I'd greatly appreciate a custom length limit option to impose sanity onto the filenames in a general sense, but I will say that since filenamify only looks at the last extension (which is what causes .meta to be elided), the best solution would be to go with the filename base idea and then tack extensions onto it without going through filenamify.

lightpohl commented 1 year ago

Okay! I think we're in a better spot with v9.0.2.

I didn't tackle making sure that all the base names are the same β€” that's a bit more tricky and will require more testing.

Mind giving it a whirl when you have a chance?

calebj commented 1 year ago

Yep, it's all working correctly now πŸ‘ŒπŸΌ