microsoft / azdo-databricks

A set of Build and Release tasks for Building, Deploying and Testing Databricks notebooks
MIT License
25 stars 32 forks source link

the deploy task does't delete files when is delete from the repo #41

Open JulsGranados opened 3 years ago

JulsGranados commented 3 years ago

image

the deloyment notebook task , only adds artifacts it doesnt delete the files on the destination workspace when is deleted from the repo with a commit

darrenpricedcww commented 3 years ago

Can we add ability to complete a full sync (including delete) as an urgent feature request if its not possible.

Thanks

peterbarber commented 3 years ago

+1 - Also require this feature. Thanks

opaessens commented 3 years ago

+1

qiuyujx commented 3 years ago

+1

I'm currently looking for alternatives such as clean the directory and then re-deploy. This won't be ideal. However, using PowerShell to manually compare and delete might be too complex.

sabacherli commented 3 years ago

+1

I'm currently looking for alternatives such as clean the directory and then re-deploy. This won't be ideal. However, using PowerShell to manually compare and delete might be too complex.

You can add following bash script to clean the workspace as a task before deploying the notebook as a quick fix. But I agree that this should be an option in the "Deploy Notebook" task itself.

workspaces=$(databricks workspace ls --absolute --profile AZDO </workspace/folder>)

echo $workspaces | grep -w -q </notebook/folder>; exists=$?

if [ $exists = 0 ]
then
    databricks workspace delete --recursive --profile AZDO  </notebook/folder>
else
    echo "Workspace does not yet exist and thus cannot be deleted."
fi
qiuyujx commented 3 years ago

You can add following bash script to clean the workspace as a task before deploying the notebook as a quick fix. But I agree that this should be an option in the "Deploy Notebook" task itself.

`workspaces=$(databricks workspace ls --absolute --profile AZDO </workspace/folder>)

echo $workspaces | grep -w -q </notebook/folder>; exists=$?

if [ $exists = 0 ] then databricks workspace delete --recursive --profile AZDO </notebook/folder> else echo "Workspace does not yet exist and thus cannot be deleted." fi`

Great thanks @sabacherli for the scripts. However, I've just realised that there is a small chance that the scheduled jobs with these notebooks may fail if the reference notebooks are deleted but not redeployed yet. My observation in our case there will be about 30 seconds gap between the actions. In other words, if the scheduled job kicked off during this 30 seconds, it will fail because the notebook doesn't exist.

I may still have to compare the files one by one and delete the specific files... Hope this feature can be added in the future.

qiuyujx commented 3 years ago

Eventually, write my own bash scripts to

Just post my script here which may help someone.

- script: |
    SRC=$(Build.SourcesDirectory)/deployment/notebooks
    TGT=.${{ parameters.databricksWorkspaceFolder }}  # a local copy of the notebooks for comparing purposes
    ADB=${{ parameters.databricksWorkspaceFolder }}

    IFS=$'\n'

    if [ -d $TGT ];
    then
      rm -r $TGT;
    fi

    databricks workspace export_dir $ADB $TGT --profile AZDO

    echo "Deleting the files not in source code... (no print means no delete)"
    for FILE in `diff -rq $TGT $SRC | grep -E "^Only in $TGT" | sed "s|Only in $TGT\(.*\): \(.*\)|\1/\2|"`
    do
      if [[ $FILE == *.py ]]
      then
        echo "Deleting notebook from workspace -> \"$ADB${FILE%.py}\""
        databricks workspace delete "$ADB${FILE%.py}" --profile AZDO
      else
        echo "Deleting folder from workspace -> \"$ADB$FILE\""
        databricks workspace delete -r "$ADB$FILE" --profile AZDO
      fi

      echo "Also delete from local buffer -> \"$TGT$FILE\""
      rm -r "$TGT$FILE"
      echo
    done

    echo "Deleting empty dirctory if there are any..."
    while [[ `find $TGT -type d -empty -print` ]];
    do
      find $TGT -type d -empty -print | sed "s|$TGT/\(.*\)|Deleting -> \"$ADB/\1\"|";
      {
        find $TGT -type d -empty -print | sed "s|$TGT/\(.*\)|databricks workspace delete \"$ADB/\1\" --profile AZDO|";
        find $TGT -type d -empty -print | sed "s|$TGT/\(.*\)|rmdir \"$TGT/\1\"|";
      } | bash;
    done
    echo "Done"
    echo 

    echo "Remove the directory for temp buffer..."
    if [ -d $TGT ];
    then
      rm -r $TGT;
      echo "$TGT has been Removed"
    fi
  displayName: 'Delete notebooks and folders not in source'
michelekorell commented 2 years ago

A fast and easy way to clear folder is to remove all recursively using the databricks workspace command with a bash script after the execution of Databricks CLI configuration task:

Screenshot 2022-06-10 at 18 09 53
steps:
- bash: 'databricks workspace rm -r --profile AZDO "/Target/Path"'
  displayName: 'Clean Destination'
Mac-delValle commented 9 months ago

Please add the functionality described above. I would like the deployment to have the option to delete resources not existing in the build.