iterative / PyDrive2

Google Drive API Python wrapper library. Maintained fork of PyDrive.
https://docs.iterative.ai/PyDrive2
Other
581 stars 69 forks source link

Does GDriveFileSystem work with Google Docs or other google formats? #191

Closed simone-viozzi closed 2 years ago

simone-viozzi commented 2 years ago

I was playing around and i found out that a minimal example like:

fs = GDriveFileSystem("root/tmp", auth)
print(fs.ls("root/tmp"))

only works if in the folder there are other folders and binary files. Otherwise it will raise:

Traceback (most recent call last):
  File "[...]/PyDrive2/pydrive2/files.py", line 224, in __getitem__
    return dict.__getitem__(self, key)
KeyError: 'md5Checksum'

because Docs/Sheet files doesn't have a checksum given their nature.

shcheklein commented 2 years ago

@simone-viozzi could you put the full stack trace please?

simone-viozzi commented 2 years ago

yes sorry, didn't think it was needed

Traceback (most recent call last):
  File "/home/simone/workspace/drive-transfere-ownership/PyDrive2/pydrive2/files.py", line 224, in __getitem__
    return dict.__getitem__(self, key)
KeyError: 'md5Checksum'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/simone/workspace/drive-transfere-ownership/tests/gdrivefs.py", line 85, in <module>
    print(fs.ls("root/tmp"))
  File "/home/simone/workspace/drive-transfere-ownership/PyDrive2/pydrive2/fs/spec.py", line 286, in ls
    "checksum": item["md5Checksum"],
  File "/home/simone/workspace/drive-transfere-ownership/PyDrive2/pydrive2/files.py", line 227, in __getitem__
    raise KeyError(e)
KeyError: KeyError('md5Checksum')

the relative code is:

https://github.com/iterative/PyDrive2/blob/64198b4c90b8074879c93f0d1eb53063b8b1c843/pydrive2/fs/spec.py#L273-L282

and than it doesn't find the md5Checksum into the metadata and fail

shcheklein commented 2 years ago

I think it should be fine for us to make it optional. Would you make a PR for this @simone-viozzi ?

simone-viozzi commented 2 years ago

Yes, but the checksum won't be the only issue here.

like the cp_file method on a gdrive file will produce an unreadable file:

fs = GDriveFileSystem("root/tmp", auth)
fs.cp_file('root/tmp/file1', 'root/tmp/file2')

image

and also a warning on the stdout:

[WARNING] googleapiclient.http -> _should_retry_response: Encountered 403 Forbidden with reason "fileNotDownloadable"

I suppose this is because right now to copy a file is downloaded and re-uploaded, and to do that with gdrive files you need to export them. Or do the copy server side.

Anyway i think i will draft a PR to discuss there, WDYT?

shcheklein commented 2 years ago

Yes, but the checksum won't be the only issue here.

@simone-viozzi we can go one by one, wdyt?

simone-viozzi commented 2 years ago

yes, of course let's start with the checksum

shcheklein commented 2 years ago

I think it should be fixed now.

simone-viozzi commented 2 years ago

There are many other methods that work unexpectedly with GDrive files.

I think the main issue is in open, because to download / read those files you have to export them in one of the supported formats. Also by doing so, some information is lost in the conversion (some spreadsheet formulas and App scrips are lost for example).

Anyway right now any method that internally use open (copy, move, cat) will read the file as plaintext plus some bytes:

For example a Google Docs file with content:

This is a test file
Some more string

id read by cat as:

b'\xef\xbb\xbfThis is a test file\r\nSome more string'

and also there is a warning from the logger googleapiclient.http:

Encountered 403 Forbidden with reason "fileNotDownloadable"

Obviously this render the file unreadable by google drive and any format information is lost.

There are multiple things we can do to fix this, like:

@shcheklein what do you think?

To export a file we need to pass the mime type to GetContentFile of GoogleDriveFile, this is used in GDriveFileSystem:

https://github.com/iterative/PyDrive2/blob/64a50a1580b2d276f8c9563f5ba620f7e7706119/pydrive2/fs/spec.py#L551-L570

and right now there is no way to propagate the mimetype until GetContentFile

shcheklein commented 2 years ago

Makes sense @simone-viozzi . Thanks. I think it's better to create a new ticket for this and fix it separately. We can go one by one. And, yes, we can pass additional parameters to open. I would create a ticket and list the methods first to review.

Methods like copy can be now potentially rewritten to utilize the new Copy implementation (and mime is not needed then).