Breaking change: help test the new Spotify folders format

mikez commented 6 months ago

Spotify changed the cache storage tech as of 2023-11-30. The code was substantially rewritten. You're invited to test the new setup and report any issues here.

So far, I've tested it on one macOS machine. More testing is especially needed for Linux and Windows.

If you were able to run this and verify your folder structure, please give a thumbs up here 👍; also, feel free to report the operating system you tested it on.

Installation instructions (unchanged)

curl -L https://git.io/folders > /usr/local/bin/spotifyfolders
chmod +x /usr/local/bin/spotifyfolders

Note: If you haven't changed your folder hierarchy recently, it may be stored in a compressed format. In that case, you may receive additional instructions on how to install the "snappy" decompression library. You can circumvent this by making a tiny change to your folder structure (which you can immediately change back again); this will store the folder hierarchy in an uncompressed cache.

(/cc @technomorph, @Nitemice, @inferrinizzard, @chocolateboy, @AlexSWall, @sydoracle, @ShiromMakkad)

mikez commented 6 months ago

macOS with Python 3.9, 3.10, 3.11, and 3.12 works here and verified folder structure. 👍

Nitemice commented 6 months ago

Seems to be totally broken for me on Windows. With the new update, I'm only getting an error message: "No data found in the Spotify cache. ...". My client, at least, doesn't seem to be using the root level log or lbd file. I'll investigate a bit more hopefully next week.

mikez commented 6 months ago

My client, at least, doesn't seem to be using the root level log or lbd file. I'll investigate a bit more hopefully next week.

👍

@Nitemice If it's using the old format, we could fall back to that. If it helps, what's the folder structure of your windows_appdata_path or windows_store_path?

On macOS, the structure is:

~/Library/Application Support/Spotify/PersistentCache
- Storage/
- Users/
  - XXX-user/
  - primary.ldb/
- public.ldb/

Nitemice commented 6 months ago

I've looked into it a bit more, and it seems like Spotify no longer uses the custom cache directory for storing the lbd files. It's in the windows_store_path for me now, and the structure matches what you described.

However, even when I run the script without the custom cache path, it fails. I've tried following what exactly it's looking for, but I don't quite understand the pattern it's trying to match in the file. Either the files on Windows are a bit different, or there's something going wrong with the pattern string. From just looking through the files naively, I've found a few that contains strings that look close to what we seem to be looking for, but they're incomplete compared to what the script is looking for.

Could you provide a bit of clarification on how the pattern string is constructed, if you know? Otherwise, I'm happy to provide some sample files if that would be helpful.

mikez commented 6 months ago

Thanks, @Nitemice!

if I understand you correctly, it sounds like the windows_store_path structure matches the macOS path structure. In that case, you should be able to run:

spotifyfolders --cache CACHE_DIR

where you replace CACHE_DIR with the windows_store_path.

If that doesn't work, then can you provide any sample files or directory structures you see? That would be most helpful. I'll look into it. — if you want to search for some pattern, look for "rootlist".

Nitemice commented 6 months ago

Hey @mikez, Yes, the structure under windows_store_path matches what you indicated on macOS, and the script already looks in that location, so I don't need to specify a CACHE_DIR. That part is working fine now.

The problem now is that the SpotifyLevelDB.get() function is failing to find a log or ldb file that contains a match for the LEVELDB_ROOTLIST_KEY. Looking though the files, and stepping through manually, it looks like it should find a match, but it fails on the bytestring_less_or_equal function (L417). I'm not sure why yet.

mikez commented 6 months ago

@Nitemice Ah, I see. The LevelDB files are there, but something seems to fail—it can't seem to find the key.

I can think of 2 ways to debug this:

You sent me a zip of your file structure under Spotify's windows_store_path and I'll take a look.
You install https://github.com/google/leveldb/ (can be challenging if you're not used to building C++) which gives you access to the leveldbutil command. With that one you can dump the contents of levelDB into plaintext and seek the contents of the key that way.

As an aside: Spotify uses a greenbase.KeyComparator to decide the order of keys in the table files. This contents of this function are not public AFAIK, so I made some educated guesses. The assumptions may be wrong here.

Nitemice commented 6 months ago

I've stepped through the code and I've managed to get it working. However, I had to make two major (probably breaking) changes.

At first, I thought the issue was that the LEVELDB_ROOTLIST_KEY was being evaluated wrong, and the group separator symbol wasn't being encoded correctly, but it turned out that it's actually the opposite. It seems that there is no group separator in the rootlist key in my levelDB. Instead it's just a space char. So I've changed L24 to:

LEVELDB_ROOTLIST_KEY = b"!pl#slc# spotify:user:{}:rootlist#"

That's odd and all, but after I made that change, I found that still didn't fix it. Instead now the code wasn't even reaching the files that contained the rootlist key. bytestring_less_or_equal returns True in a bunch of cases where the keys don't match, which caused it to break the loop and just give up searching. My initial fix was to edit bytestring_less_or_equal as below, because while I understand this behaviour when the function is used as a comparator, we're not really using it as one here:

        if byte1 == group_separator and byte2 != group_separator:
            return False
        if byte1 != group_separator and byte2 == group_separator:
            return False
        if byte1 < byte2:
            return False
        elif byte1 > byte2:
            return False

But on further review of the code, a simpler fix is to simply remove the break on L422.

I have pushed a branch with those changes for you to have a proper look: https://github.com/Nitemice/spotify-folders/commit/63a40c789de273dd7a95997816c8c2d81ccce194

mikez commented 6 months ago

@Nitemice Nice debugging there! :)

Change 1 is very interesting, and I'd like to learn more about this—is this Windows specific, is there something else going on? I wonder if we could get others to give us data here.

Change 2 makes it so that every key in every table is traversed. This takes longer. However, it could be a fall-back option, in case no key is found.

Here's what I propose in terms of changes (for now):

have two search keys. On macOS seek the first one first, on Windows the other one first. Then flip if nothing is found.
make a regular (fast) search; if that fails, then make the slower search.

I'm assuming the greenbase.KeyComparator may need fixing. What would be really helpful, is if you can check all the keys in your tables, and see how they're ordered; I'm curious if you can find any patterns. For me, it seems to be an alphanumeric regular ordering, except for the group_separators. If you don't know how to output the keys in a ldb file (in their stored order), I can write you a script.

mikez commented 6 months ago

@Nitemice Digging some more, I've noticed the pattern of the keys here seem to be:

PREFIX {SPACER} SUFFIX

where PREFIX can be something like:

!pl#add#
!pl#changes#
!cit#cit# etc.

and the spacers are

$
%
&
\x1d etc.

Does the spacer encode some number, is there some logic behind it? I don't understand. Also, the ordering of the keys becomes more confusing the closer I look at it.

mikez commented 6 months ago

@Nitemice FYI, pushed a new version based on our current understanding.

Nitemice commented 6 months ago

Change 1 is very interesting, and I'd like to learn more about this

I agree, it'd be good to have more data from others to see if it's the same. If I had to guess, I'd say it's a Windows vs Unix thing, but who knows!

Change 2 makes it so that every key in every table is traversed.

I'll admit I didn't fully understand what a lot of this code was doing, and I probably still don't fully get it. But now that I've more closely examined this loop, and the bytestring_less_or_equal function, I think I get why the break is there.

I think I've found a simpler solution that shouldn't slow down the runtime much if at all. What if we modify the bytestring_less_or_equal function so that space and group-separator character are both treated as "group separators"? I've made a commit as a demo, and it seems to work well for me: https://github.com/Nitemice/spotify-folders/commit/974692e32779c3c8e40a18be10b6b0ddeb4d406f

have two search keys. On macOS seek the first one first, on Windows the other one first. Then flip if nothing is found.

I think this is a reasonable solution, but if we start finding even more alternatives, we may need to rethink it (maybe in combination with the solution above?). Only issue with this is I tried running your updated code, and it wouldn't work for me. Apparently there isn't a reverse() function for tuples, so I had to turn the key_templates from a tuple to a list.

---    key_templates = (LEVELDB_ROOTLIST_KEY_1, LEVELDB_ROOTLIST_KEY_2)
+++    key_templates = [LEVELDB_ROOTLIST_KEY_1, LEVELDB_ROOTLIST_KEY_2]

make a regular (fast) search; if that fails, then make the slower search.

Like I mentioned, I think the above solution is cleaner and simpler. But it's really up to you.

If you don't know how to output the keys in a ldb file (in their stored order), I can write you a script.

I spent a large chunk of time over the last few days trying to dump the ldb various ways, but nothing has worked for me on Windows. If you've got a script that you think'll work (and doesn't depend on a dozen extra things), I'd be happy to give it a go.

I've noticed the pattern of the keys here seem to be:

Yeah, I think PREFIX {SPACER} SUFFIX is accurate, although each section can vary wildly. The details that seem most fixed to me are:

each prefix begins with a '!' and ends with a '#'
each suffix ends with a '#'

I've also seen the following as spacers:

\x16
'

mikez commented 6 months ago

I think I've found a simpler solution that shouldn't slow down the runtime much if at all. What if we modify the bytestring_less_or_equal function so that space and group-separator character are both treated as "group separators"? I've made a commit as a demo, and it seems to work well for me: Nitemice@974692e

I like simple solutions! :) In this specific case, this seems a bit speculative for me (and prone to false negatives), since we don't quite know yet what that "SPACER" symbol means. (Including its usage in greenbase.KeyComparator.) There may be times where it may have another function. I'd opt to be more conservative here and do the "slow search" until we have more clarity on greenbase.KeyComparator.

have two search keys. On macOS seek the first one first, on Windows the other one first. Then flip if nothing is found.

I think this is a reasonable solution, but if we start finding even more alternatives, we may need to rethink it (maybe in combination with the solution above?). Only issue with this is I tried running your updated code, and it wouldn't work for me. Apparently there isn't a reverse() function for tuples, so I had to turn the key_templates from a tuple to a list.

👍 This is an error. Thank you. I pushed a fix.

make a regular (fast) search; if that fails, then make the slower search.

Like I mentioned, I think the above solution is cleaner and simpler. But it's really up to you.

I much prefer cleaner and simpler. I see this as a transitory solution until we have clarity on how greenbase.KeyComparator works. Then we can massively clean up the code and only need one attempt.

If you don't know how to output the keys in a ldb file (in their stored order), I can write you a script.

I spent a large chunk of time over the last few days trying to dump the ldb various ways, but nothing has worked for me on Windows. If you've got a script that you think'll work (and doesn't depend on a dozen extra things), I'd be happy to give it a go.

https://gist.github.com/mikez/dcd8cd65319049e434820c3c9459cc4a

No dependencies; except for the folders.py script. Put this into the same directory as that file. Read the description in the beginning for a basic primer on which keys are where in what order.

As for greenbase.KeyComparator: I haven't understood the ordering. Sometimes it seems alphanumeric; at other times, it seems the SPACER symbol has higher weight than the rest, even the characters before it.

Nitemice commented 6 months ago

In this specific case, this seems a bit speculative for me (and prone to false negatives)

I don't disagree. You're probably right; this is the best we can do for now, until we have a much better handle on what's going on with the key's structure and sorting.

I've run the script you provided (thanks for that!) on the file that contain the 'rootlist' keys, because really, that's the only one we care about. I've attached a anonymised/summarised version of the output. 042129_summ.txt

Basically, it looks to be in alphanumeric order, with group separators (in this case space chars) treated as greater than other chars. We need some people on other platforms to confirm that it's the same for them, and we should be good from there.

mikez commented 5 months ago

Thank you, @Nitemice, this is very helpful. I now figured out the spacer symbol. It encodes the bytelength of the upcoming data. The same pattern seems to be used for all !-keys:

!{prefix}#{len}{data}#{len}{data}#…#"

Sometimes the data can contain #, ! and such symbols too, but the "len" prefix makes that clear. I pushed an update making the code simpler.

Greenbase comparator is still unclear.

mikez commented 5 months ago

Ok, I think I figured out the comparator approximately too now. It's essentially alphanumeric, but ignores the varint {length} sections. Pushed new version. Leaving in slow_search as a backup in case DB gets corrupted.

mikez commented 5 months ago

@Nitemice Can you confirm that this works on your end?

Nitemice commented 5 months ago

@Nitemice Can you confirm that this works on your end?

Yes, working for me now. Thank you!

mikez / spotify-folders

Breaking change: help test the new Spotify folders format #10

Installation instructions (unchanged)