CrowCamCleanup doesn't output proper JSON to crowcam-videodata

tfabris commented 5 months ago

CrowCamCleanup.sh writes out a file called crowcam-videodata which contains the playlist information of my CrowCam Archives playlist. In a separate project (not part of this GitHub repo), I copy that data file over to my web site and use some PHP to parse the file so that it can display the playlist on my web site.

The problem is that the file is not valid JSON because it simply takes each page of the YouTube API response (which maxes out at 50 items per page) and it just concatenates all of those pages together. CrowCamCleanup.sh makes no attempt to actually parse the JSON or to make parseable JSON. It creates a file that looks like this structurally:

     {
      "kind": "youtube#playlistItemListResponse", (...),
      "items": [  (...),(...),(...) ],
      "pageInfo": { (...) }
     }
     {
      "kind": "youtube#playlistItemListResponse", (...),
      "items": [  (...),(...),(...) ],
      "pageInfo": { (...) }
     }

In order to turn the list back into parseable JSON, the web site has to fix that part. The web site PHP code fixes it by looking for the square brackets (the ones that define the end of one "items" list and the beginning of the next one) and rips out anything in between, thus creating a parseable piece of JSON with a single large unpaginated list.

This works but it makes me feel dirty. The file that it writes should be parseable JSON when it's written, and the web site's PHP shouldn't have to strip out that section. CrowCamCleanup.sh should do the stripping instead. Here one line of code (and lots of explanation) that will successfully do it inside of CrowCamCleanup. This needs to be tested on the Synology itself so that we can be sure it works as expected without a performance problem:

# At this point, the string $uploadsOutput is a collection of multiple pages'
# worth of JSON queries concatenated together, at 50 results per page. This
# produces invalid JSON output when treated as a single string. It looks like
# this right now:
#     {
#      "kind": "youtube#playlistItemListResponse", (...),
#      "items": [  (...),(...),(...) ],
#      "pageInfo": { (...) }
#     }
#     {
#      "kind": "youtube#playlistItemListResponse", (...),
#      "items": [  (...),(...),(...) ],
#      "pageInfo": { (...) }
#     }
# What we want is to run all the items together into a single list of items. Do this by
# finding the section between each items-bracket-end and each items-bracket-beginning
# and replacing the whole thing with a single comma. It replaces all the intermediate
# redundant instances of "pageinfo" and "kind:playlistItemListResponse" and leaves only
# a nice clean-smelling list of "items":[(...),(...),(...)] in the middle which is now
# proper JSON again. It still contains one "kind:playlistItemListResponse" at the top and
# one "pageinfo" at the end, and those are OK and we want to keep them.
# 
# Explanation of the SED statement below:
#
#  sed '   '      Spec the search with singlequotes so that you can search doublequotes without escaping.
#  s/             Substitute the following pattern with another pattern.
#  \]             Search for a single close square bracket (escaped).
#  , "pageinfo":  Search for a comma, a space, "pageinfo", a colon, and a space.
#  .*             Search for any number of any characters after "pageinfo": .
#   "items":      Search for space, "items", a colon, and a space.
#  \[             Search for a single open square bracket (escaped).
#  /              End of the sed "search for" phrase and beginning of the sed "replace with" phrase
#  ,              Replace with a single comma character.
#  /g             Greedy, replace all possible occurrences instead of just the first.
#
uploadsOutput=$( echo $uploadsOutput | sed 's/\], "pageInfo": .* "items": \[/,/g' )

tfabris commented 5 months ago

The code above is faulty. It works more or less as designed, except it doesn't correctly handle more than two pages' worth of data. It strips the cruft between all intermediary pages, meaning that it will strip the cruft from the end of page1 all the way up to the beginning of page7. Meaning that the final output, while still perfectly good parseable JSON, includes only the items[] from page1 and from page7, not from any of the pages in between. Investigating a better parsing statement to fix it.

tfabris commented 5 months ago

Fixed with checkin 51fcce9 - The final code needed a little bit more finesse before it properly stripped the redundant sections. It ended up being this instead:

# Explanation of the SED statement below:
#
#  sed '   '      Surround SED command with singles so that you can use unescaped doubles.
#  s_             Substitute the following pattern with another pattern.
#   _             Use underscores as the SED delimiters instead of the traditional slashes
#                 that SED normally uses, because all the forward and back slashes together
#                 was hurting my tiny monkey brain.
#  \]             Search for a single close square bracket (escaped), the end of "items:[]"
#  , "pageinfo":  Search for a comma, a space, "pageinfo", a colon, and a space, the section
#                 I want to remove after all but the final items:[] section.
#  [   ]*         SPECIAL TRICK: We want to get anything up until the opening of the next 
#                 "items:[]" section. But if we just searched for ".*" here, it fails
#                 because it would grab all of the middle "items:[]"" sections because it
#                 would greedily grab everything up until the FINAL occurrence of the
#                 opening bracket (i.e., the LAST "items:[]" section). And there's a side
#                 issue which is that the lazy match search "(.*?)" would work here if only
#                 SED would support it (it doesn't). So we have to do this trick...
#  [   ]*         Search for any number of characters within this character set...
#  [^\[]*         But you're searching for any characters which are NOT (^) an open square
#                 bracket (escaped) so that you non-greedily grab up until the FIRST
#                 occurrence of said bracket instead of the last one. We want to grab up to
#                 the opening bracket of the first occurrence of the next upcoming "items:[]"
#  \[             Search for a single open square bracket (escaped). This is to grab just that
#                 one bracket character that follows the the prior search.
#  _              End of the sed "search for" phrase, start of the sed "replace with" phrase
#  ,              Replace with a single comma character.
#  _g             Greedy, replace all possible occurrences instead of just the first.
#
# This should work to remove all the doubled-up sections between the end of every "items:[]"
# section (which always precedes a "pageinfo:" section) and then up until the opening bracket
# of the next "items:[]" section. This should also be safe even if the user types square
# brackets into the video description, because it's only searching for a lonely opening
# square bracket in the cruft area between the items, not inside each item.
uploadsOutput=$( echo $uploadsOutput | sed 's_\], "pageInfo":[^\[]*\[_,_g' )

tfabris / CrowCam

CrowCamCleanup doesn't output proper JSON to crowcam-videodata #83