iwonbigbro / gsync

RSync for Google Drive - GSync
Other
240 stars 50 forks source link

Drive allows forward slash in file titles and lack of extension. #53

Open quadrater opened 10 years ago

quadrater commented 10 years ago

Google Drive allows forward slash in file names unlike Unix and Windows file systems which causes all kinds of interesting behaviors. In gsync a file on Drive named "C/D/E" located in the "/A/B/" folder will be treated as the file "E" residing in the folder "/A/B/C/D/" and since the folder "/A/B/C/" does not exist the file is ignored without error on retrieval.

In the Google Drive Mac client forward slashes in file names are replaced with underscore. This is fine and all, but requires the client to keep a persistent of list of mapped file names "C/D/E" => "C_D_E" otherwise on update the file would be synced back to Drive with the new name "C_D_E" and updates on the original file "C/D/E" would overwrite changes in "C_D_E" on the client side. Curiously if both the files "C/D/E" and "C_D_E" exist in the same folder in Drive one of the files will be named "C_D_E (1)" in the Mac Drive client.

Part two of this puzzle is that Drive allows file titles lacking file extensions but mandates that a downloaded file have the file extension as indicated in the API query response. This is causing the same challenge as the forward slashes, downloaded files can exist under file names not equal to the files title on Drive. And, yes, the Mac Drive client implements this.

Let me know your thoughts on how to implement this in the long run. Printing warnings may be appropriate in the short term.

iwonbigbro commented 10 years ago

Gsync is targeted at POSIX compliant platforms. I can certainly create one way translation between filenames in drive to local file names, but these will not be tracked since tracking is not portable. Tools like gsync and rsync are supposed to be decoupled from a particular environment. Requiring tracking information to be stored creates coupling and will introduce unnecessary complexity and reduce robustness.

It seems to be a really bad design decision to adopt slashes as part of a filename, because this breaks platform portability. Tragically, those responsible for such design decisions are on their own. On 5 Apr 2014 22:13, "quadrater" notifications@github.com wrote:

Google Drive allows forward slash in file names unlike Unix and Windows file systems which causes all kinds of interesting behaviors. In gsync a file on Drive named "C/D/E" located in the "/A/B/" folder will be treated as the file "E" residing in the folder "/A/B/C/D/" and since the folder "/A/B/C/" does not exist the file is ignored without error on retrieval.

In the Google Drive Mac client forward slashes in file names are replaced with underscore. This is fine and all, but requires the client to keep a persistent of list of mapped file names "C/D/E" => "C_D_E" otherwise on update the file would be synced back to Drive with the new name "C_D_E" and updates on the original file "C/D/E" would overwrite changes in "C_D_E" on the client side. Curiously if both the files "C/D/E" and "C_D_E" exist in the same folder in Drive one of the files will be named "C_D_E (1)" in the Mac Drive client.

Let me know your thoughts on how to implement this in the long run.

Reply to this email directly or view it on GitHubhttps://github.com/iwonbigbro/gsync/issues/53 .

quadrater commented 10 years ago

I sort of agree with the idea that allowing slashes is a bad design decision, then again not.

Once up on a time I designed a global distributed object store and naming things anyway you'd like is part of the base design or rather the opposite. Unlike a file system, an object store isn't really concerned with the name of a file but an abstract id of an object. It can be implemented as a flat id to object mapping without hierarchies so the basic foundation is very simple, very scalable and very distributable. The id can even be implemented as a hash of the data in the object enabling multiple copies of the same object occupying less space (but introduces the need for reference counting which often is worse). The meta data layer is usually implemented as a fully separate entity often implemented on top of some more classical ACID database or similar structure providing some kinds of guarantees about the consistency of the data. The file name or title of the object is just a unicode string, the path to the object is one or several unicode strings and may or may not be implemented as a hierarchy.

With that in mind the design goals of POSIX, the file system interface specification, and Google Drive object store is quite different and gsync has to cope with that difference to some extent.

iwonbigbro commented 10 years ago

Can't the system be extended to deal only with url encoded filenames? That would solve the problem and require no transcoding or tracking. It works require an upgrade step though, but that would be fairly trivial. On 6 Apr 2014 08:08, "quadrater" notifications@github.com wrote:

I sort of agree with the idea that allowing slashes is a bad design decision, then again not.

Once up on a time I designed a global distributed object store and naming things anyway you'd like is part of the base design or rather the opposite. Unlike a file system, an object store isn't really concerned with the name of a file but an abstract id of an object. It can be implemented as a flat id to object mapping without hierarchies so the basic foundation is very simple, very scalable and very distributable. The id can even be implemented as a hash of the data in the object enabling multiple copies of the same object occupying less space (but introduces the need for reference counting which often is worse). The meta data layer is usually implemented as a fully separate entity often implemented on top of some more classical ACID database or similar structure providing some kinds of guarantees about the consistency of the data. The file name or title of the object is just a unicode string, the path to the object is one or several unicode strings and may or may not be implemen ted as a hierarchy.

With that in mind the design goals of POSIX, the file system interface specification, and Google Drive object store is quite different and gsync has to cope with that difference to some extent.

Reply to this email directly or view it on GitHubhttps://github.com/iwonbigbro/gsync/issues/53#issuecomment-39660921 .

quadrater commented 10 years ago

Sounds like an excellent idea, escaping with perfect gsync to Drive mapping. URL encoding seems to be a good route considering that you would get rid of a number of other characters causing grief on the command line as well and it's a very well known encoding.

!   *   '   (   )   ;   :   @   &   =   +   $   ,   /   ?   #   [   ]

As long as it's not interfering with Unicode encoded characters it sounds like a nice way forward.

quadrater commented 10 years ago

I did some investigation to find out more about the real life use cases for forward slash in file names. In my Drive about 25 of the 22k-ish files have forward slash in the file names, they are either Google Docs, which I may or may not control the names of, or Google Chrome printing pages with the "Save to Google Drive" feature and the pages lacking HTML title causing the title displayed to be the part of the URL.

Another part mentioned in the issue is the file extension. In my drive I have a handful of files without extensions, most of them looks like they're printed with Chrome and using the HTML document title as the title excluding the extension so they end up being pdfs lacking the .pdf extension. That's more of an application bug on the Chrome side IMHO.

There's actually a third part to this and that is the Google Docs themselves not having an object representation, they are essentially just a link to opening them on the web. The Mac Google Drive implementation is a json stub and using specific registered file extensions like gsheet to enable clicking on the file to open it. Realizing this was more an eye opener that I can't actually sync my Google Docs at all, not even with the native client. Oh, and gsync creates a zero length file without extension. Not sure what the right thing to do here is?

{ "url": "https://docs.google.com/a/.../spreadsheets/d/10ucf2KMq...W7GiYkMfw/edit?usp=docslist_api",
"resource_id": "spreadsheet:10ucf2KMq...W7GiYkMfw"}
iwonbigbro commented 10 years ago

I could implement partial url encoding, encoding only back and forward slashes in files when storing locally, which would preserve everything else. That way, url decode would work, decoding only those characters that are url encoded.

That said, it would still require some changes to software that deals with filenames where slashes are expected. Always url decoding a filename would have performance issues, but it would make it more robust. On 6 Apr 2014 08:43, "quadrater" notifications@github.com wrote:

Sounds like an excellent idea, escaping with perfect gsync to Drive mapping. URL encoding seems to be a good route considering that you would get rid of a number of other characters causing grief on the command line as well and it's a very well known encoding.

! * ' ( ) ; : @ & = + $ , / ? # [ ]

As long as it's not interfering with Unicode encoded characters it sounds like a nice way forward.

Reply to this email directly or view it on GitHubhttps://github.com/iwonbigbro/gsync/issues/53#issuecomment-39661422 .

quadrater commented 10 years ago

You could do partial url encoding like the maps below. Be ware of the not entirely obvious dual case decoding of hexadecimals, not that it should occur but for completeness. Otherwise using URL encoding/decoding on the reserved characters should work as well.

drive_to_posix = {'%': '%25', '/': '%2f'}
posix_to_drive = {'%25': '%', '%2f': '/', '%2F': '/'}
quadrater commented 10 years ago

For completeness: Google Drive also allows forward slash in folder names. To be super complete POSIX/UNIX filenames may not include NULL-characters either.

Does this also open the Pandoras box regarding handling of files and folders named a/../../../../../b traversing outside of the download folder?

iwonbigbro commented 10 years ago

Google drive is not a file system, it is a database system that has a filesystem user interface. This means there is no restriction on filenames because they are just document titles. Folders don't exist in Google drive, parents do. A single file can have many parents. Folders are just metadata files for documents to reference as parents, of which can also have parents. Any file with no parents appears in the Google drive root. There is also no trash folder. Files have a trash state. When this state is true, they are only visible as a child of the trash parent. Restoring from trash just removes this flag. Also, filenames or document titles are not unique identifiers. This means multiple documents with the same name can coexist, because their ID's are unique.

With this in mind, filesystem restrictions are not imposed by Google drive and synchronisation with filesystems will fail. You can synchronise one way only, in the event of duplicate document titles or non file system compliant characters in file names. Anyone using Google drive to back a software application need to enforce their own document title validation and duplicate key validation, because Google drive will not provide it.

I don't intend on supporting non filesystem specific use of Google drive. If Gsync is intended to be used for backup purposes, then files will have originated from a filesystem, and therefore will naturally comply with filesystem rules, be it POSIX or NT. On 7 Apr 2014 20:43, "quadrater" notifications@github.com wrote:

For completeness: Google Drive also allows forward slash in folder names. To be super complete POSIX/UNIX filenames may not include NULL-characters either.http://en.wikipedia.org/wiki/Filename#Comparison_of_filename_limitations

Does this also open the Pandoras box regarding handling of files and folders named a/../../../../../b traversing outside of the download folder?

Reply to this email directly or view it on GitHubhttps://github.com/iwonbigbro/gsync/issues/53#issuecomment-39774399 .

jhkrischel commented 10 years ago

A few thoughts from the peanut gallery:

1) I like the fact that gsync seems to follow symlinks if they are specified explicitly, but otherwise seems to ignore them.

For example, on my dreamhost account, I created a folder ~/gdrive, and then created symlinks to every other folder in ~/ (except gdrive).

Using this command, I can create a backup of all these folders in a subfolder in google drive:

find ./ -maxdepth 1 | xargs -I {} gsync --progress -rc {}/ drive://\!dreamhost/{}/

2) Ideally, I'd like to be confident that if I had to sync in the other direction, from google drive back to my file system, they'd be effectively identical. I plan on testing this by grabbing all my files into a separate dir, then running an rsync between the originals and my downloaded copy just to check them.

3) My primary goals here are to backup files on my hosting service (dreamhost), my wife's iPhoto library, my Aperture library, and my iTunes folder.

That all being said, if there are moments of ambiguity after I test my sync back from google drive, I'll be sure to file issues relating to the specific instances I find. My guess is that symlinks won't restore clean - now that I think about it, I might just try create some sample project including variations of symlinks, to see how they behave in the current version.

jhkrischel commented 10 years ago

Test script:

mkdir /Users/Shared/gsync-test
cd /Users/Shared/gsync-test
git init; mkdir targetdir
echo "test file" > test.txt
ln -s ./targetdir/ ./symlinkdir
ln -s ./test.txt ./symlinkfile.txt
git add .; git ci -am "initial commit"; cd ..
gsync --progress -rc ./gsync-test/ drive://gsync-test/
mkdir gsync-test-from
gsync --progress -rc drive://gsync-test/ /Users/Shared/gsync-test-from/

Test results:

rsync, on the other hand, handles both symlink file and directory properly.

Fun fact, though - since I created the file system as a git repository, I can actually restore the proper file system through git:

cd /Users/Shared/gsync-test-from/
git reset --hard HEAD