inspera / blackbricks

Black for Databricks notebooks
MIT License
44 stars 9 forks source link

Troubleshooting failed parse #16

Closed asears closed 2 years ago

asears commented 2 years ago

When running blackbricks, version 0.6.7 on a file that cannot be parsed, instead of handling the error I am getting this error.

Traceback (most recent call last):
  File "C:\Users\xyz\.conda\envs\abc123\lib\runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "C:\Users\xyz\.conda\envs\abc123\lib\runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "C:\Users\xyz\.conda\envs\abc123\Scripts\blackbricks.exe\__main__.py", line 7, in <module>
  File "C:\Users\xyz\.conda\envs\abc123\lib\site-packages\typer\main.py", line 214, in __call__
    return get_command(self)(*args, **kwargs)
  File "C:\Users\xyz\.conda\envs\abc123\lib\site-packages\click\core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "C:\Users\xyz\.conda\envs\abc123\lib\site-packages\click\core.py", line 782, in main
    rv = self.invoke(ctx)
  File "C:\Users\xyz\.conda\envs\abc123\lib\site-packages\click\core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "C:\Users\xyz\.conda\envs\abc123\lib\site-packages\click\core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "C:\Users\xyz\.conda\envs\abc123\lib\site-packages\typer\main.py", line 497, in wrapper
    return callback(**use_params)  # type: ignore
  File "C:\Users\xyz\.conda\envs\abc123\lib\site-packages\blackbricks\cli.py", line 183, in main
    n_changed_files = process_files(
  File "C:\Users\xyz\.conda\envs\abc123\lib\site-packages\blackbricks\cli.py", line 25, in process_files
    content = file_.content
  File "C:\Users\xyz\.conda\envs\abc123\lib\site-packages\blackbricks\files.py", line 31, in content
    return f.read()
  File "C:\Users\xyz\.conda\envs\abc123\lib\encodings\cp1252.py", line 23, in decode
    return codecs.charmap_decode(input,self.errors,decoding_table)[0]

Is there some verbose command that can be used to troubleshoot which file is causing the issue in the repo? Can this error be handled safely as part of returning the reader?

bsamseth commented 2 years ago

The partial error message you posted shows that the last blackbricks-related file involved is blackbricks/files.py, line 31. Seemingly it fails to read the file. Other than that, I can't say much without more details. Please include a minimal reproducing example (i.e. a file which when running blackbricks on it, causes this error).

asears commented 2 years ago

Oddly enough, it was caused by a pycache folder being in the folder. Running blackbricks on individual files works. Deleting this folder solved the issue, though it has to be deleted in each subfolder it's located in. Probably some error handling around the readfile or increased verbosity to print out the file in question could help, or ignoring pycache folder. Not sure why it is acting on this folder. I'm running on Windows in Powershell.

bsamseth commented 2 years ago

Aha, that explains it. Like black, blackbricks will add files in folders recursively, meaning blackbricks . should try to format everything in the current folder. I guess what black does that blackbricks doesn't, is filtering out files that can't be parsed, such as binary files. It will already skip any files that don't have the Databricks header in it, but I did not consider binary files. It currently tries to read the files as text in order to check for the header line, but of course, that fails for binary files. The __pychace__ folders contain .pyc files, which are binary, hence the bug you see.

I'll see if I can't add a fix in not too long, should be a matter of filtering out Python files instead of adding any file. If you want I'm happy to review a PR :)

asears commented 2 years ago

Investigating the PR, I see blackbricks uses os rather than pathlib. I've added a PR to cover the files.py content before doing any changes. Any preference on using os rather than pathlib? I think upgrading to pathlib might help with reconciling to black's code and improve compatibility.

Based on https://github.com/github/gitignore/blob/main/Python.gitignore there are a bunch of files to ignore.

So far I've only encountered pycache folder being an issue, so perhaps excluding these makes sense:

# Byte-compiled / optimized / DLL files
__pycache__/
*.py[cod]
*$py.class

An option to handle and skip the corrupt folders/file(s) might help with further inclusions, exclusions, and these could be exposed as some parameter.

Black does this "out of the box" so perhaps there's some means of removing the blackbricks code and implementing the required black api, or re-implementing a portion of what black does.

https://black.readthedocs.io/en/stable/usage_and_configuration/file_collection_and_discovery.html

Here's some relevant files searching for .gitignore https://github.com/psf/black/blob/521d1b8129c2d83b4ab49270fe7473802259c2a2/src/black/files.py https://github.com/psf/black/blob/cae7ae3a4d32dc51e0752d4a4e885a7792a0286d/src/black/__init__.py https://github.com/psf/black/blob/fb9fe6b565ce8a9beeebb51c23f384d1865d0ee8/tests/test_black.py

Just looking at the code, should be a few lines to change in resolve_filepaths to extend parameter(s) to ignore files and folders. A few possible implementations here...

bsamseth commented 2 years ago

This will be fixed in #25. Nothing complex, just simply skipping files that can't be read as Unicode text. There are many more complicated ways to do this, but in order to keep this project small and manageable, I think this enables 99% of use cases.