Improve the speed of CrowCamCleanup.sh

tfabris commented 8 months ago

The script CrowCamCleanup.sh runs slower than I would like. With my current video playlist it's taking about 11 minutes to run on my Synology. The biggest inefficiency is the code loop here:

jsonLineCount=0
LogMessage "dbg" "Processing JSON results from the API queries, this may take a moment"
while IFS= read -r line
do
    ((jsonLineCount++))
    if [ $(($jsonLineCount % 1000)) = 0 ]
    then
      LogMessage "dbg" "Processing JSON line $jsonLineCount. Found so far: ${#playlistItemIds[@]} playlistItemIds, ${#videoIds[@]} videoIds, ${#titles[@]} titles"
    fi

    lineResult=$( echo $line | grep '"id"' | cut -d '"' -f4 )
    if ! [ -z "$lineResult" ]
    then
      playlistItemIds+=( "$lineResult" )
    fi

    lineResult=$( echo $line | grep '"title"' | cut -d '"' -f4 )
    if ! [ -z "$lineResult" ]
    then
      titles+=( "$lineResult" )
    fi

    lineResult=$( echo $line | grep '"videoId"' | cut -d '"' -f4)
    if ! [ -z "$lineResult" ]
    then
      videoIds+=( "$lineResult" )
    fi
done <<< "$uploadsOutput"

I think it can be made much speedier by re-doing the code more like this (the following code is from something I recently added to TestFile.sh):

# Parse out all of the titles, video IDs, and start times from the video data
# Regex string looks like this:
#    "(title|videoId|actualStartTime)": "([^"])*"
# Which means:
#    "          Find a quote
#    (          Find one of these things in this group
#    title      Find the word title
#    |videoId   or the word videoId
#    |actual... or the word actualStartTime
#    )          Close up that group of things
#    ": "       Find a quote, a colon, a space, and a quote
#    (          Find the things in in this group
#    [^"]       Find anything that's NOT a quote
#    )          Close up that group of things
#    *          Find any number of instances of that group in a row (characters that aren't quotes)
#    "          Find a quote
#
# The returns three strings from each entry in the JSON that look like this, in
# this order:
#
#    "title": "7:28 am - Juvenile crow begs from parent quite intensely"
#    "videoId": "1Y9BFUyzpys"
#    "actualStartTime": "2020-09-01T13:50:13Z"
#
#
# Special notes about this code which greps the data:
# - Grep commands: -o only matching text returned, -h hide filenames, -E extended regex
# - arrayName=( ):  Make sure to have the outer parentheses to make it a true array.
# - IFS_backup=$IFS; IFS=$'\n': IFS is the way it splits the resulting array. Normally it
#   splits on space/tab/linefeed, I'm changing it to just split on linefeed so that each
#   return value from the regex grep is its own array element.
IFS_backup=$IFS
IFS=$'\n'
videoDataArray=( $(grep -o -h -E '"(title|videoId|actualStartTime)": "([^"])*"' $videoData) )
IFS=$IFS_backup

# Syntax note: the pound sign retrieves the count/size of the array.
videoDataArrayCount=${#videoDataArray[@]}  

# Process all items
loopIndex=0
videosProcessed=0
while [ $loopIndex -lt $videoDataArrayCount ]   
do
    # Freshen variables at the start of each loop
    currentTitle=""
    currentvideoId=""
    currentActualStartTime=""

    # Record the first of the three items, the title
    oneVideoDataItem=${videoDataArray[loopIndex]}
    currentTitle=$( echo $oneVideoDataItem | cut -d '"' -f4 )

  ((loopIndex++))

    # Record the second item, is the video ID
    oneVideoDataItem=${videoDataArray[loopIndex]}
    currentvideoId=$( echo $oneVideoDataItem | cut -d '"' -f4 )

  ((loopIndex++))

    # Record third item, the start time
    oneVideoDataItem=${videoDataArray[loopIndex]}
    currentActualStartTime=$( echo $oneVideoDataItem | cut -d '"' -f4 )

  ((loopIndex++))  # Update this value after logging, so the number is still correct.
done

tfabris commented 8 months ago

I've checked in one big fix for this, the fix is currently sitting in the issue69 code branch (not merged to master yet): 01a112e

This improves the speed so that it only takes 3 minutes to run instead of 11 minutes. I think I can get that 3 minutes down even lower if I work on the next section. The next slow part is the part where it checks each of the item's timestamps and sees if the timestamp is more than x days old, and decides whether to remove that video or not. That loop is pretty slow and comprises the majority of the remaining slowness. I am leaving this bug open until I can optimize that.

tfabris commented 8 months ago

The slowness is in this SED command:

uploadsOutput=$( sed "s/\"$oneVideoId\"/&,\"actualStartTime\": \"$actualStartTime\", \"actualEndTime\": \"$actualEndTime\"/g" <<< $uploadsOutput )

The string "uploadsOutput" is large, and I'm re-sedding it dozens upon dozens of times. Each one takes about one second to process.

Doing "echo $uploadsOutput | sed" is no faster than doing "sed <<< $uploadsOutput", I tried that.

Investigating other methods.

tfabris commented 8 months ago

I am going to close this as "finished". My initital optimization improved the runtime from 11 minutes to 3 minutes, and that's more than enough.

The remaining optimization was being a problem.

Here's what I was planning to do:

I had the idea that I would optimize it by using SED to split the playlist's JSON into an array. Each entry in the array would be a hunk of JSON that represented each playlistentry.
I could then process each of those entries one at a time, each one of them being very small, avoiding the slowdown that had been caused by re-sedding the entire array for each processing step (the problem described in my prior comment).
That would also happen to make it easy to delete one of the videos in the middle of processing, thus also fixing bug #73 at the same time.

The problems:

The SED command is very tricky once you start getting into edge cases. And the documentation online is very sketchy and the behaviors differ greatly between SED versions (BSD vs GNU for example) and I couldn't get SED to behave itself. For example, just asking it to insert a linefeed before a curly brace took days of fiddling to find out how to make it work. And then I needed to SED for two different regex searches at the same time, and that took another entire day of fiddling to find out how to make that work.
Once I found out how to make it work and had it running on my local computer, then I went to copy it to the synology and peformance test it (just the array splitting, not the whole program mind you) and it just KILLED the synology. The livestream died from the CPU overload, and the splitting command which took only 40 seconds on my local PC took TEN MINUTES on the Synology. And that was only in debug mode where I limited it to only two pages of playlist output from the YouTube API. If I had upped that to the full playlst, it would have taken three times that long.

So I'm halting this work for now and calling it good as-is. I'll save off the temporary work-in-progress stuff because there are some good notes in that code, some good things I could put to use later, but for now, this is dead in the water.

tfabris / CrowCam

Improve the speed of CrowCamCleanup.sh #72