tfabris / CrowCam

A set of Bash scripts to control and maintain a YouTube live cam from a Synology NAS.
GNU General Public License v3.0
3 stars 3 forks source link

Improve the speed of CrowCamCleanup.sh #72

Closed tfabris closed 5 months ago

tfabris commented 5 months ago

The script CrowCamCleanup.sh runs slower than I would like. With my current video playlist it's taking about 11 minutes to run on my Synology. The biggest inefficiency is the code loop here:

jsonLineCount=0
LogMessage "dbg" "Processing JSON results from the API queries, this may take a moment"
while IFS= read -r line
do
    ((jsonLineCount++))
    if [ $(($jsonLineCount % 1000)) = 0 ]
    then
      LogMessage "dbg" "Processing JSON line $jsonLineCount. Found so far: ${#playlistItemIds[@]} playlistItemIds, ${#videoIds[@]} videoIds, ${#titles[@]} titles"
    fi

    lineResult=$( echo $line | grep '"id"' | cut -d '"' -f4 )
    if ! [ -z "$lineResult" ]
    then
      playlistItemIds+=( "$lineResult" )
    fi

    lineResult=$( echo $line | grep '"title"' | cut -d '"' -f4 )
    if ! [ -z "$lineResult" ]
    then
      titles+=( "$lineResult" )
    fi

    lineResult=$( echo $line | grep '"videoId"' | cut -d '"' -f4)
    if ! [ -z "$lineResult" ]
    then
      videoIds+=( "$lineResult" )
    fi
done <<< "$uploadsOutput"

I think it can be made much speedier by re-doing the code more like this (the following code is from something I recently added to TestFile.sh):

# Parse out all of the titles, video IDs, and start times from the video data
# Regex string looks like this:
#    "(title|videoId|actualStartTime)": "([^"])*"
# Which means:
#    "          Find a quote
#    (          Find one of these things in this group
#    title      Find the word title
#    |videoId   or the word videoId
#    |actual... or the word actualStartTime
#    )          Close up that group of things
#    ": "       Find a quote, a colon, a space, and a quote
#    (          Find the things in in this group
#    [^"]       Find anything that's NOT a quote
#    )          Close up that group of things
#    *          Find any number of instances of that group in a row (characters that aren't quotes)
#    "          Find a quote
#
# The returns three strings from each entry in the JSON that look like this, in
# this order:
#
#    "title": "7:28 am - Juvenile crow begs from parent quite intensely"
#    "videoId": "1Y9BFUyzpys"
#    "actualStartTime": "2020-09-01T13:50:13Z"
#
#
# Special notes about this code which greps the data:
# - Grep commands: -o only matching text returned, -h hide filenames, -E extended regex
# - arrayName=( ):  Make sure to have the outer parentheses to make it a true array.
# - IFS_backup=$IFS; IFS=$'\n': IFS is the way it splits the resulting array. Normally it
#   splits on space/tab/linefeed, I'm changing it to just split on linefeed so that each
#   return value from the regex grep is its own array element.
IFS_backup=$IFS
IFS=$'\n'
videoDataArray=( $(grep -o -h -E '"(title|videoId|actualStartTime)": "([^"])*"' $videoData) )
IFS=$IFS_backup

# Syntax note: the pound sign retrieves the count/size of the array.
videoDataArrayCount=${#videoDataArray[@]}  

# Process all items
loopIndex=0
videosProcessed=0
while [ $loopIndex -lt $videoDataArrayCount ]   
do
    # Freshen variables at the start of each loop
    currentTitle=""
    currentvideoId=""
    currentActualStartTime=""

    # Record the first of the three items, the title
    oneVideoDataItem=${videoDataArray[loopIndex]}
    currentTitle=$( echo $oneVideoDataItem | cut -d '"' -f4 )

  ((loopIndex++))

    # Record the second item, is the video ID
    oneVideoDataItem=${videoDataArray[loopIndex]}
    currentvideoId=$( echo $oneVideoDataItem | cut -d '"' -f4 )

  ((loopIndex++))

    # Record third item, the start time
    oneVideoDataItem=${videoDataArray[loopIndex]}
    currentActualStartTime=$( echo $oneVideoDataItem | cut -d '"' -f4 )

  ((loopIndex++))  # Update this value after logging, so the number is still correct.
done
tfabris commented 5 months ago

I've checked in one big fix for this, the fix is currently sitting in the issue69 code branch (not merged to master yet): 01a112e

This improves the speed so that it only takes 3 minutes to run instead of 11 minutes. I think I can get that 3 minutes down even lower if I work on the next section. The next slow part is the part where it checks each of the item's timestamps and sees if the timestamp is more than x days old, and decides whether to remove that video or not. That loop is pretty slow and comprises the majority of the remaining slowness. I am leaving this bug open until I can optimize that.

tfabris commented 5 months ago

The slowness is in this SED command:

uploadsOutput=$( sed "s/\"$oneVideoId\"/&,\"actualStartTime\": \"$actualStartTime\", \"actualEndTime\": \"$actualEndTime\"/g" <<< $uploadsOutput )

The string "uploadsOutput" is large, and I'm re-sedding it dozens upon dozens of times. Each one takes about one second to process.

Doing "echo $uploadsOutput | sed" is no faster than doing "sed <<< $uploadsOutput", I tried that.

Investigating other methods.

tfabris commented 5 months ago

I am going to close this as "finished". My initital optimization improved the runtime from 11 minutes to 3 minutes, and that's more than enough.

The remaining optimization was being a problem.

Here's what I was planning to do:

The problems:

So I'm halting this work for now and calling it good as-is. I'll save off the temporary work-in-progress stuff because there are some good notes in that code, some good things I could put to use later, but for now, this is dead in the water.