gradio-app / gradio

Build and share delightful machine learning apps, all in Python. 🌟 Star to support our work!
http://www.gradio.app
Apache License 2.0
29.84k stars 2.22k forks source link

after prediction, the input and output files remain in the temporary directory. #3950

Closed falibabaei closed 2 months ago

falibabaei commented 1 year ago

Hello, I am using gradio to create a UI for my API. The problem is that after prediction, the input and output files remain in the temporary directory. This causes two problems

  1. user privacy
  2. I will run out of memory in the future. Is it possible to remove this automatically?
Frenziecodes commented 1 year ago

@falibabaei Yes, it's possible to remove the input and output files from the temporary directory automatically. You can use the tempfile module in Python to create temporary files and directories, and set them to delete automatically when the program ends.

falibabaei commented 1 year ago

I use tempfile to save the input files and I know I can program it to remove the files after the request is finished, but what I mean is that Gradio removes the files themselves. I have swagger UI and there I do not need any code to remove the input and output files. The output file is just a stream and after the user is done with the prediction everything is deleted without me having to enter any code. I had a memory problem, and when I investigated it, I found that the input files remain in the temporary directory and cause this problem. Otherwise, I did not know anything about it, and I could not find anything in the documentation about this problem. I think it is a serious problem and should be mentioned somewhere.

abidlabs commented 1 year ago

Hi @falibabaei the basic reason for this is that Gradio doesn't "know" when a user has stopped using the application and it is safe to clear the temporary files. I'll open this issue up for brainstorming if anyone has any suggestions

tomchang25 commented 1 year ago

There is an ugly solution in the Gradio interface. I haven't started tracking the code of blocks yet, so I'm not sure if there is a similar solution.

import gradio as gr
import os

def ret(name):
    return name

demo = gr.Interface(
    ret,
    "video",
    "video",
)

if __name__ == "__main__":
    try:
        demo.launch()
    finally:
        print(len(demo.input_components[0].temp_files))
        print(len(demo.input_components))
        print(len(demo.output_components[0].temp_files))
        print(len(demo.output_components))
        for x in demo.input_components[0].temp_files:
            try:
                os.remove(x)
            except OSError as e:
                print(f"Error deleting file: {e}")
            finally:
                print(x)

        for x in demo.output_components[0].temp_files:
            try:
                os.remove(x)
            except OSError as e:
                print(f"Error deleting file: {e}")
            finally:
                print(x)

But there are two issues with it:

  1. It requires the user to write additional code
  2. The image will not be deleted cleanly when using gr.Image(type='filepath') because there are two temporary files. image image
falibabaei commented 1 year ago

Thank you very much for the quick reply. @tomchang25 I have the second problem with my input files, which are not images. They are just text files, but I still have two temporary files. I have the path of one file and can delete it, but I do not know about the second one. The same is true for the output files. Also, I use the block

tomchang25 commented 1 year ago

@falibabaei, Yes, it seems that there are some leaking resources in temp_files. Some temporary files are redundant or not added to temp_files, so they remain in the temporary directory indefinitely. So, if we want to solve this problem, we may need to go through all the IOComponent to ensure that no temporary files are missed

Anyway, I have packaged all this code so that users can use the deconstruct method to clear the temporary files after finishing the process.

// IOComponent
...
    def deconstruct(self):
        while self.temp_files:
            temp_file = self.temp_files.pop()
            os.remove(temp_file)
...

// Interface
...
    def deconstruct(self):
        for x in self.input_components:
            if isinstance(x, IOComponent):
                x.deconstruct()

        for x in self.output_components:
            if isinstance(x, IOComponent):
                x.deconstruct()
...
import gradio as gr
import os

def ret(name):
    return name

demo = gr.Interface(
    ret,
    gr.Image(type="filepath"),
    "image",
)

if __name__ == "__main__":
    try:
        demo.launch()
    finally:
        demo.deconstruct()

However, I'm not sure where the component is stored in Blocks. I would like to hear your opinion on this, @abidlabs

aliabid94 commented 1 year ago

For output files, it's hard to know when a usage has expired, since we return a link to the output files and not the file itself. I think the best approach would be to

  1. Have some maximum total size limit to all generated temporary files. If we create a new file that causes us to exceed this limit, we delete the oldest available temporary files until we are under the limit. This will limit the app from using up all the disk and set a maximum disk usage.
  2. When the app exits, we clean up all the generated temporary files.
cowanAI commented 11 months ago

hello @abidlabs just confirming if this is still open? do you know if users can have still access to temp files, and if yes, is there any documentation to prevent this and protect the privacy of the user?

abidlabs commented 11 months ago

Hi @cowanAI yes this is still open. You can take a look at the current security policy here: https://gradio.app/sharing-your-app/#security-and-file-access

cowanAI commented 11 months ago

@abidlabs so basically the highest grade of security I can get is by creating the most random custom temporary directory is that correct? even if Im deploying from a docker container hackers could get access to the files of my users? what if I use the 'chmod' command to set the appropriate permissions. For example, setting the folder to be accessible only by the owner using chmod 700 /path/to/temp/folder ?

cowanAI commented 11 months ago

also, dont you think this exposes all gradio users to an incredible grade of legal liability?

cowanAI commented 11 months ago

ohh I also forgot something very important, I actually encrypted my EBS volume storage from the EC2 instance, do you think that helps random people accesing and eavesdropping the files of my users?

cowanAI commented 11 months ago

it seems that this doesnt apply to EC2 instances docker containers, docker containers doesnt expose their temp files to users

msis commented 11 months ago

Why not use the tempfile within a context:

# audio_file: bytes
with tempfile.NamedTemporaryFile(suffix=".gradio") as temp:
  temp.write(audio_file) 
  temp.flush()

  # do the business here

# The file is automatically removed now

It is safe to suppose that once the callback returns, the file isn't needed anymore and can be removed.

I'm thinking of a decorator maybe that can be added to callbacks that will use large files:

def temp_file_decorator(func):
    @wraps(func)
    def wrapper(large_file: bytes, *args, **kwargs):
        with tempfile.NamedTemporaryFile(suffix=".tmp", delete=False) as temp_file:
            temp.write(large_file)
            temp.flush()
            result = func(temp_file.name, *args, **kwargs)
        return result
    return wrapper

Then we can use it like the example in #4620 :

@temp_file_decorator
def get_duration_ms(audio_file):
    duration = mediainfo(audio_file)["duration"]  # a string in seconds
    duration_ms = int(float(duration) * 1000)
    print(f"{audio_file}: {duration_ms} ms")
    return duration_ms
cowanAI commented 9 months ago

@msis did this work for you?

cowanAI commented 9 months ago

Hello @abidlabs @freddyaboulton do you have a solution by any chance in case we want to generate bigger files and avoid the app to stop working due to ram leakage in production?

abidlabs commented 9 months ago

Sorry @cowanAI we don't have a workaround at the moment, it's something we're going to look into.

henryruhs commented 8 months ago

Hi @falibabaei the basic reason for this is that Gradio doesn't "know" when a user has stopped using the application and it is safe to clear the temporary files.

Sounds like a misconception in your application...

  1. no API to unload / destroy a component
  2. no lifetime or session id for temp files
  3. no cleanup of the session on exit / server shutdown
  4. no API to bypass temp files... allow us to refer directly to the origin source path
jcheroske commented 7 months ago

I'm currently messing around with the UploadButton component. I've discovered that uploading a file actually creates two /tmp directories, each with a copy of the file. The .upload() event handler method only passes one of those files into the handler function, so the other directory and file are not easily deleted. I just figured that this should get a look when the tmpfile code gets an overhaul.

freddyaboulton commented 7 months ago

Thanks @jcheroske - I believe this is fixed in the v4 branch which will turn into gradio 4.0

Gauntlet173 commented 3 months ago

I think you need to update the documentation to make it explicit that the temporary files are available to "all (authenticated) users", not just "users".

Gauntlet173 commented 3 months ago

Here's my uneducated pitch: If auth is set in .launch(), generate a random key on startup, and encrypt and decrypt temporary files with a hash of that key and the uploading user's authentication token. Then even if you have a working password, and the filename, you would need the session key (which unlike the filename is never displayed on the screen) and the random key (which is server-side only, and lost when the server terminates), in order to overcome the encryption.

TashaSkyUp commented 3 months ago

Gradio temp files infinitely grow. This is problematic. Because every time I generate an image and pass it to an image component it generates another temp file, even though I'm passing as numpy array. Gradio should manage the total size of the temp files it generates. As it stands it seems like Gradio is eating up lots and lots of space on lots and lots of peoples hard drives, for no good reason.

image

freddyaboulton commented 2 months ago

Hi @TashaSkyUp ! Working on cleaning up the temp directory here #7447

freddyaboulton commented 2 months ago

This will be possible in the next release of gradio with the delete_cache parameter. It is a tuple of integers (frequency, age). Set it to (3600, 3660) to delete all files older than an hour every hour.

Runist commented 2 months ago

I think you should add a button wich can control whether delete cache