feature request: support for in memory upload/download media

omarryhan / aiogoogle

Async Google API Client + Async Google Auth

https://aiogoogle.readthedocs.io/en/latest/

MIT License

193 stars 47 forks source link

feature request: support for in memory upload/download media #67

Closed sanzoghenzo closed 3 years ago

sanzoghenzo commented 3 years ago

Thanks a lot for this great library!

I would like to use it in a tool that performs automated backups of autodesk bim360 files (but it could be any other storage service) to a google drive folder.

As of now, the only way to use the drive method files.create() is to specify the upload_file parameter as the file path, but I would like to use a file object (the one downloaded from the other service) without writing it to memory first.

I see that the fire_request coroutine in AiohttpSession.send does the reading of the file via aiofiles.open, and then passes the file object (or MultipartWriter object) to self.request.

Do you think it could be a good addition to support my use case?

If so, I could try to make a PR for that, but I would like to know your thoughts on this (and tips on how to accomplish this in a clean way).

omarryhan commented 3 years ago

Hey Andrea, thanks!

I'm not sure I fully understand your request. Can you elaborate on this please:

but I would like to use a file object (the one downloaded from the other service) without writing it to memory first.

What kind of file object? Also, why do you not want to write to memory first?

In all cases you must use memory to upload anything. What the Multipart writer does is that it uploads any given file in chunks instead of loading an entire file (which can be in GBs) to memory and uploading it in one go.

sanzoghenzo commented 3 years ago

Hi Omar, thanks for the quick response!

I used the wrong terms, forgive my poor English! I wanted to say exactly the opposite 😅

What I meant is that I don't want to write the file to disk, but use the content of a variable (the result of the download from the other service), to skip some extra step (write to a temporary file, give it to files.create() and delete the file).

omarryhan commented 3 years ago

Don't worry about it!

Gotcha. I can see how that can be useful. I'll be happy to accept a PR for that.

Please correct me if I'm wrong, essentially what we'll be doing is instead of opening a file and passing the read bytes in the request's body, we want to pass bytes to the request directly.

I'm thinking a good place to pass it is when creating the request. So instead of passing upload_file we pass upload_body and this parameter should accept bytes.

I'm open to different suggestions.

sanzoghenzo commented 3 years ago

Please correct me if I'm wrong, essentially what we'll be doing is instead of opening a file and passing the read bytes in the request's body, we want to pass bytes to the request directly.

Exactly that!

I'm thinking a good place to pass it is when creating the request. So instead of passing upload_file we pass upload_body and this parameter should accept bytes.

I'm open to different suggestions.

So the plan should be:

add the upload_body parameter to Method.__call__()
check that upload_file and upload_body are not both present at the same time
create a media_upload-like object that handles bytes instead of file path?
handle the request.media_upload inside the fire_request coroutine based on the type of the instance/presence of file_path

Maybe the last point is better accomplished by moving the validation and reading of the file to MediaUpload/ResumableUpload (and the new class, i.e. MediaBytesUpload) methods, so that there will be more separation of concerns.

Do you have anything against the use of dataclasses for the declaration of the models? I'm (over)using them (along with their cousin pydantic BaseModel) to get automatic dunder methods (and validation, in the case of BaseModel).

Let me know what you think

omarryhan commented 3 years ago

We can actually use upload_file for both bytes and file names. I am always hesitant to add new kwargs to __call__() because it might conflict with the URL parameters being passed in **kwargs. If I can go back in time, I would replace **kwargs with a simple url_parameters kwarg.

About creating a media_upload object, why not use the same one? and add a file_body property instead. And in Resource.__call__ we check if the passed upload_file is of type bytes, then we pass it to file_body and if it's a string, we pass it to file_path.

ResumableUpload is actually being passed to MediaUpload. Think of MediaUpload as the parent upload object and everything else should be passed to it to avoid duplication.

About dataclasses to me it was just easier to use what I already know works than using the (back then, new) dataclass feature which I knew little about. Assuming you're confused about why some of the models inheriting from dict, I did that because initially, I only wanted to keep some of the models as dicts to enforce simplicity. But then I wasn't able to auto-document them using sphinx (our docs lib). So I made this funky class-dict hybrid.

sanzoghenzo commented 3 years ago

Well, I've fully embarked the typing/type hinting boat, so I try to avoid giving a parameter multiple types, especially if different types means different behaviors. For that I prefer to have separate, smaller classes, so it's clear what each of the objects does, and use composition to build my app.

This is what I thought to do:

class MediaUpload:
    def __init__(
        self,
        file_path,
        upload_path=None,
        mime_range=None,
        max_size=None,
        multipart=False,
        chunk_size=None,
        resumable=None,
        validate=True,
    ):
        self.file_path = file_path
        self.upload_path = upload_path
        self.mime_range = mime_range
        self.max_size = max_size
        self.multipart = multipart
        self.chunk_size = chunk_size or DEFAULT_UPLOAD_CHUNK_SIZE
        self.resumable = resumable
        self.validate = validate

    async def run_validation(self):
        if self.validate and self.max_size:
            size = await self.get_size()
            if size > self.max_size:
                raise ValidationError(
                    f'"{self.file_path}" has a size of {size / 1000}KB. '
                    f'Max upload size for this endpoint is: '
                    f'{self.max_size / 1000}KB.'
                )

    async def aiter_file(self):
        return _aiter_file(self.file_path, self.chunk_size)

    async def read_file(self):
        async with aiofiles.open(self.file_path, "rb") as file:
            return await file.read()

    async def get_size(self):
        stat = await async_os.stat(self.file_path)
        return stat.st_size

@dataclass
class MediaUploadBytes:
    body: bytes
    upload_path: str = None
    mime_range: Any = None
    max_size: int = None
    multipart: bool = False
    chunk_size: int = DEFAULT_UPLOAD_CHUNK_SIZE
    resumable: Optional[ResumableUpload] = None
    validate: bool = True

    async def run_validation(self):
        if self.validate and self.max_size:
            size = await self.get_size()
            if size > self.max_size:
                raise ValidationError(
                    f'"{self}" has a size of {size / 1000}KB. '
                    f'Max upload size for this endpoint is: '
                    f'{self.max_size / 1000}KB.'
                )

    async def aiter_file(self):
        for x in range(0, len(self.body), self.chunk_size):
            yield self.body[x:x + self.chunk_size]

    async def read_file(self):
        return self.body

    async def get_size(self):
        return len(self.body)

This way, we delegate the handling of the upload media to the class itself, and each class has its logic instead of using multiple ifs.

Well, I still have to add a conditional inside the _build_upload_media to initialize the right class, but after that there's no more type/parameter checking involved.

As for the ResumableUpload object, where is it used? Could I remove the file_path and make it compatible with both classes?

omarryhan commented 3 years ago

First of all, thank you so much!

I have a couple of issues with this.

We can't use aiofiles and async_os in here because they are asyncio only. i.e. they don't work on Trio nor Curio. I kind of regret making the guarantee to support these libs because I don't think people actually use them. But it's a decision we have to live with. Especially that the cost of supporting them at this point isn't that big.

I get your point about preferring composition over inheritance. But having a separate class with only one different property (body vs file_path) doesn't sit well with me, especially that the difference between them is minor. If we'll be making a separate class, can you please use inheritance for that?

For the sake of consistency, can we not use dataclasses? I don't mind converting all the models to dataclasses if we can make sure they'll be backward compatible and generally won't incur any unneeded costs, performance or otherwise.

About the resumable object. Yes, it's not being used. I decided to keep it to encourage future contributors to implement the network part of the resumable feature.

To answer your question, yes, you can remove the file_path attribute.

sanzoghenzo commented 3 years ago

Oh, I didn't think about the other async frameworks! And I appreciate that maybe two classes are too much for this.

Let's move the composition in the MediaUpload methods then, so each framework can pass the right async function

class MediaUpload:
    def __init__(
        self,
        file_path_or_bytes,
        upload_path=None,
        mime_range=None,
        max_size=None,
        multipart=False,
        chunk_size=None,
        resumable=None,
        validate=True,
    ):
        if isinstance(file_path_or_bytes, bytes):
            self.file_body = file_path_or_bytes
            self.file_path = None
        else:
            self.file_body = None
            self.file_path = file_path_or_bytes
        self.upload_path = upload_path
        self.mime_range = mime_range
        self.max_size = max_size
        self.multipart = multipart
        self.chunk_size = chunk_size or DEFAULT_UPLOAD_CHUNK_SIZE
        self.resumable = resumable
        self.validate = validate

    async def run_validation(self, size_func):
        if self.validate and self.max_size:
            size = await size_func(self.file_path) if self.file_path else len(self.file_body)
            if size > self.max_size:
                raise ValidationError(
                    f'"{self}" has a size of {size / 1000}KB. '
                    f'Max upload size for this endpoint is: '
                    f'{self.max_size / 1000}KB.'
                )

    async def aiter_file(self, aiter_func):
        if self.file_path:
            return aiter_func(self.file_path, self.chunk_size)
        return self._aiter_body()

    async def _aiter_body(self):
        for x in range(0, len(self.file_body), self.chunk_size):
            yield self.file_body[x:x + self.chunk_size]

    async def read_file(self, read_func):
        return self.file_body or await read_func(self.file_path) 

    def __str__(self):
        return self.file_path or "File object"

This won't break the class initialization (well, in _build_upload_media you passed the path as keyword argument, but it was a positional one and it's safe to remove the name of the argument) and should work on each case.

I suspect I'll be the contributor for the resumeable feature, since I have to backup large files 😉

omarryhan commented 3 years ago

Looks pretty good to me 👍

I suspect I'll be the contributor for the resumeable feature, since I have to backup large files 😉

Definitely looking forward to that! :)