ostrolucky / Simple-Subreddit-Image-Downloader

Bash script for downloading images posted in specific subreddit with minimal footprint
GNU General Public License v2.0
54 stars 6 forks source link

Enforced title sanitization more effectively: Using `tr -d '/\r\n'`and `sed "s/\// /g"` to remove any potential newlines and slashes from titles. #18

Closed Montana closed 6 months ago

Montana commented 6 months ago
ostrolucky commented 6 months ago

Overall I like the patch but we need to clarify bunch of issues:

Could you explain why is tr -d '/\r\n | sed "s/\// /g more effective than whatever we have now?

Other points:

  1. echo -n was important to fix #4. Why was it removed?
  2. wait needs to be part of every while loop as a form of throttling, otherwise script can easily spawn thousands simultaneous processes and connections
  3. please remove print_usage function and instead inline it like it was before, I prefer less indirection
  4. Similarly, please revert unnecessary placeholder substitution. printf "$i/$limit : $newname\n" is more readable than printf "%d/%d : %s\n" "$i" "$limit" "$newname". Or replace with echo
Montana commented 6 months ago

Hi @ostrolucky,

The tr -d '/\r\n' command ensures that any potentially problematic characters (slashes, carriage returns, newlines) are removed from the title, making the filenames safer and less likely to cause errors when saving, secondly combining tr and sed streamlines the sanitization process, reducing the risk of missed problematic characters that might be overlooked by sed alone.

In other words filenames without slashes, carriage returns, and newlines are more likely to be accepted by the file system, avoiding issues with invalid characters.

So the current approach in this script is:

newname=`echo $name | sed "s/^\///;s/\// /g"`_"$subreddit"_$id.$ext

The approach I took was the following:

newname=`echo $name | tr -d '/\r\n' | sed 's/\// /g'`_"$subreddit"_$id.$ext

As you requested I've removed print_usage:

#!/bin/bash

#cfg
subreddit=$1
sort=$2
top_time=$3
limit=$4

if [ "$1" == "-h" ] || [ -z $subreddit ]; then
echo "Usage: $0 SUBREDDIT [hot|new|rising|top|controversial] [all|year|month|week|day] [limit]
Examples:   $0 starterpacks new week 10
            $0 funny top all 50"
exit 0;
fi

if [ -z $sort ]; then
    sort="hot"
fi

if [ -z $top_time ];then
    top_time=""
fi

if [ -z $limit ]; then
    limit=0
fi

url="https://www.reddit.com/r/$subreddit/$sort/.json?raw_json=1&t=$top_time"
content=`curl $url`
mkdir -p $subreddit
i=1
while : ; do
    urls=$(echo -n "$content"| jq -r '.data.children[]|select(.data.post_hint|test("image")?) | .data.preview.images[0].source.url')
    names=$(echo -n "$content"| jq -r '.data.children[]|select(.data.post_hint|test("image")?) | .data.title')
    ids=$(echo -n "$content"| jq -r '.data.children[]|select(.data.post_hint|test("image")?) | .data.id')
    a=1
    for url in $urls; do
        name=`echo -n "$names"|sed -n "$a"p`
        id=`echo -n "$ids"|sed -n "$a"p`
        ext=`echo -n "${url##*.}"|cut -d '?' -f 1 | sed 's/gif/png/' `
        newname=`echo $name | tr -d '/\r\n' | sed 's/\// /g'`_"$subreddit"_$id.$ext
        printf "$i/$limit : $newname\n"
        curl --retry 3 --no-clobber --output "$subreddit/$newname" $url &>/dev/null &
        ((a=a+1))
        ((i=i+1))
        if [ $i -gt $limit ] ; then
          wait 
          exit 0
        fi
    done
    wait 
    after=$(echo -n "$content"| jq -r '.data.after//empty')
    if [ -z $after ]; then
        break
    fi
    url="https://www.reddit.com/r/$subreddit/$sort/.json?count=200&after=$after&raw_json=1&t=$top_time"
    content=`curl --retry 3 $url`
done

So inside the for loop I've placed each image download command (curl) that runs in the background (&). The wait command ensures that all background processes finish before the script exits if the limit is reached.

Cheers, Michael Mendy

ostrolucky commented 6 months ago

The tr -d '/\r\n' command ensures that any potentially problematic characters (slashes, carriage returns, newlines) are removed from the title

As far as I know, so does | sed "s/^\///;s/\// /g". Or am I missing something?

Montana commented 6 months ago

The tr -d '/\r\n' command ensures that any potentially problematic characters (slashes, carriage returns, newlines) are removed from the title

As far as I know, so does | sed "s/^\///;s/\// /g". Or am I missing something?

This command is useful when you want to remove specific characters from the text, so the following sed command would be just that:

So let's start with tr -d '/\r\n':

tr -d '/\r\n':

To move to sed "s/^\///;s/\// /g":

I think the confusion is, these commands serve different purposes and are not interchangeable. Use tr when you need to delete specific characters, and use sed when you need to perform pattern-based text replacements. So you'll notice when I added to the pull request:

newname=`echo $name | tr -d '/\r\n' | sed 's/\// /g'`_"$subreddit"_$id.$ext

This was a dual use case, to make sure all these functions were being ran.

Cheers, Michael Mendy

ostrolucky commented 6 months ago

Ok thanks! I've incorporated these improvements with ab7df66e8b4aa6ac238ba28227ac0ea0fa154b20

xawpaw commented 6 months ago

yo thanks, script runs a lot smoother @Montana