googlecolab / colabtools

Python libraries for Google Colaboratory
Apache License 2.0
2.2k stars 722 forks source link

moving/renaming files caused data loss? #4184

Open danielthedifficult opened 11 months ago

danielthedifficult commented 11 months ago

Describe the current behavior This afternoon, I used a bash script via Google Colab to move and rename files in our company Google Drive. Following the move and rename a substantial quantity of the files are MIA, while the rest were perfectly moved/renamed.

Describe the expected behavior I expected it to perform the same as my local testing of the script.

What web browser you are using Brave (Chromium)

Additional context Here's the script I ran to perform the rename.

I wanted to rename from YEAR/CLIENT/PROJECT to CLIENT/YEAR_CLIENT/PROJECT

For folders (years) 2012-2016

%%bash
cd ./drive/MyDrive/MY_ARCHIVE_FOLDER &&
for year in {2012..2014}; do # NOTE I DID THIS IN TWO STEPS, 2012-2013, THEN 2014-2016
    for client in "$year"/*; do
        [ -d "$client" ] || continue
        client_name=$(basename "$client")
        mkdir -p "$client_name/$year"_"$client_name"
        for project in "$client"/*; do
            mv "$project" "$client_name/$year"_"$client_name"/
        done
        rmdir "$client"
    done
    rmdir "$year"
done

After doing this, I am missing tons of files that are neither in the new (renamed) location, nor in the file structure of the trashed folder.

What's also strange is that the folders were deleted in a way that causes them to appear in the Trash UI, but none of the deleted files are there.

Worth noting these are about 6-7 TB of files that I'm renaming/moving.

The Drive team advised me to use the restore feature to restore everything from today's session to try to get it back, but I'm not hopeful, as restoring from the Trash in the Drive UI was already unfruitful.

Even some of the folders that did get moved/renamed properly seem to be missing files. I have a folder that was previously evaluated at 73 GB, and downloading it I have only ~688 MB

Help me Colab team, you're my only hope!

cperry-goog commented 11 months ago

I'm chasing folks internally here but not having luck so far :(

danielthedifficult commented 11 months ago

@cperry-goog, I really appreciate the follow up, even if it's just to say that there isn't much of an update :)

For what it's worth, the case number Product Engineering is supposed to be using is 48302562.

I have a suspicion... I'm guessing that the linux interpreter that Colab uses to run the bash scripts is tied to some sort of virtualization of a linux filesystem provided by your Drive service. If so, I imagine that virtualization is some sort of 'invisible' middleware, and perhaps does not execute filesystem instructions synchronously?

i.e. if I have a for loop to mv files, then rmdir after it's done, is there a chance the rmdir ran before the mv commands finished?

Let me be the first to say I'm a total idiot for including the rmdir in the original script, I should have done the mv's and then deleted the directories afterwards.

Anyway, these represent 5 years of company data, please allow me to squeeze every last drop of empathy I can from you and the Product team to try to get this resolved and the data restored before it's lost forever 🙏