vocalpy / vak

A neural network framework for researchers studying acoustic communication
https://vak.readthedocs.io
BSD 3-Clause "New" or "Revised" License
78 stars 16 forks source link

vak prep crashes because of annotation file encoding #382

Closed marichard123 closed 2 years ago

marichard123 commented 3 years ago

When running the prep stage, at the point after which I believe the spectrograms are created, I get the following error message (I have attached the full error traceback at the end of the message):

File "c:\users\richard\anaconda3\envs\vak-env\lib\encodings\cp1252.py", line 23, in decode return codecs.charmap_decode(input,self.errors,decoding_table)[0] UnicodeDecodeError: 'charmap' codec can't decode byte 0x90 in position 164: character maps to

Initially I thought it was a problem of Python not being set to the correct encoding standard, so inside the cp1252.py file I tried manually setting the encoding procedure to ANSI and UTF-8, which the created text files were created in, with no success. I then noticed that in the second half of the file there was a decoding table, a tuple in which the encoding codes were all manually listed. Screenshot (106) Among them were codes mapping to "undefined". Byte 0x90 indeed maps to 'undefined'. In other words, it seems to me that rather than a case of 0x90 not being defined in whatever encoding procedure Python is using due to encoding mismatch, that 0x90 is hard-coded to map to "undefined", and that the problem lies within whatever file the program is reading from. I'm not sure how to identify the file that's causing the problem/how to pinpoint what exactly is causing the 0x90 to appear in it. Have you run into a similar error during development/would you have any insight into the nature of the problem?

Full error traceback:

(vak-env) PS C:\Users\Richard\Documents\Fall_2021\Bat_Stuff\TweetynetPipeline> vak prep gy6or6_train.toml
determined that purpose of config file is: train
will add 'csv_path' option to 'TRAIN' section
purpose for dataset: train
will split dataset
making array files containing spectrograms from audio files in: C:\Users\Richard\Documents\Fall_2021\Bat_Stuff\TweetynetPipeline
creating array files with spectrograms
[########################################] | 100% Completed | 10.5s
creating dataset from spectrogram files in: C:\Users\Richard\Documents\Fall_2021\Bat_Stuff\TweetynetPipeline\spectrograms_generated_211108_024036
validating set of spectrogram files
[########################################] | 100% Completed |  9.8s
creating pandas.DataFrame representing dataset from spectrogram files
[########################################] | 100% Completed | 10.4s
Traceback (most recent call last):
  File "c:\users\richard\anaconda3\envs\vak-env\lib\runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "c:\users\richard\anaconda3\envs\vak-env\lib\runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "C:\Users\Richard\Anaconda3\envs\vak-env\Scripts\vak.exe\__main__.py", line 7, in <module>
  File "c:\users\richard\anaconda3\envs\vak-env\lib\site-packages\vak\__main__.py", line 45, in main
    cli.cli(command=args.command, config_file=args.configfile)
  File "c:\users\richard\anaconda3\envs\vak-env\lib\site-packages\vak\cli\cli.py", line 30, in cli
    COMMAND_FUNCTION_MAP[command](toml_path=config_file)
  File "c:\users\richard\anaconda3\envs\vak-env\lib\site-packages\vak\cli\prep.py", line 146, in prep
    logger=logger,
  File "c:\users\richard\anaconda3\envs\vak-env\lib\site-packages\vak\core\prep.py", line 226, in prep
    logger=logger,
  File "c:\users\richard\anaconda3\envs\vak-env\lib\site-packages\vak\split\split.py", line 138, in dataframe
    labels = labels_from_df(vak_df)
  File "c:\users\richard\anaconda3\envs\vak-env\lib\site-packages\vak\labels.py", line 79, in from_df
    annots = annotation.from_df(vak_df)
  File "c:\users\richard\anaconda3\envs\vak-env\lib\site-packages\vak\annotation.py", line 107, in from_df
    scribe.from_file(annot_path) for annot_path in vak_df["annot_path"].values
  File "c:\users\richard\anaconda3\envs\vak-env\lib\site-packages\vak\annotation.py", line 107, in <listcomp>
    scribe.from_file(annot_path) for annot_path in vak_df["annot_path"].values
  File "c:\users\richard\anaconda3\envs\vak-env\lib\site-packages\crowsetta\csv.py", line 220, in csv2annot
    set_header = set(reader.fieldnames)
  File "c:\users\richard\anaconda3\envs\vak-env\lib\csv.py", line 98, in fieldnames
    self._fieldnames = next(self.reader)
  File "c:\users\richard\anaconda3\envs\vak-env\lib\encodings\cp1252.py", line 23, in decode
    return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x90 in position 164: character maps to <undefined>
NickleDave commented 3 years ago

Hi @marichard123 thank you for raising a clear detailed issue -- sorry you're having this problem.

We are very excited to work with a computer scientist that would actually think to look at the encoding used 😁 but I think your hunch is right, that a good place to start is hunting down the offending file.

Before doing that: are you able to provide a little bit more information about your annotation file(s)? Is there a single .csv file with all the annotations for every audio file, or is it one annotation file per audio file?

I'm wondering if a quick fix is to simply save the original file(s) in a different encoding. If you can tell me more about how you generated them, that might help us figure it out.

I can see that the crash occurred when crowsetta tried to open one of them.

  File "c:\users\richard\anaconda3\envs\vak-env\lib\site-packages\crowsetta\csv.py", line 220, in csv2annot
    set_header = set(reader.fieldnames)
  File "c:\users\richard\anaconda3\envs\vak-env\lib\csv.py", line 98, in fieldnames
    self._fieldnames = next(self.reader)
  File "c:\users\richard\anaconda3\envs\vak-env\lib\encodings\cp1252.py", line 23, in decode
    return codecs.charmap_decode(input,self.errors,decoding_table)[0]

If you are okay with sharing annotation files either here or by email, I am also happy to run them through crowsetta to see if I can diagnose what's going on.

It might be the case that we need to do a little work on the crowsetta code to make csv loading more general -- I'd really appreciate if you can help me figure that out before we have a ton of users testing things and getting angry about encoding errors 😅

It's probably better to use something than our own hand-rolled csv read/write code anyway (as discussed in this issue)

As far as finding the file: A quick way to troubleshoot might be making a dataset with a single audio file and see if it still crashes.

It looks like the crash happened right at the start of checking annotations, so it might not be worth going to all this trouble.

But if I was going to try and track down the file, I would set a breakpoint with pdb as far down the stack I can get. In this case, that looks like it's inside c:\users\richard\anaconda3\envs\vak-env\lib\site-packages\crowsetta\csv.py. Specifically this line: https://github.com/NickleDave/crowsetta/blob/3db98293b7babd1526dc0f28d7919181ec5b591a/src/crowsetta/csv.py#L207

    with open(csv_filename, 'r', newline='') as csv_file:
        reader = csv.DictReader(csv_file)

        # DictReader automatically uses first row (AKA 'header') as fieldnames
        # when no argument supplied for fieldnames parameter
        # so we use that default to check validity of csv fieldnames
        set_header = set(reader.fieldnames)
        if set_header != set(CSV_FIELDNAMES):

I would edit the file to look something like this:

    with open(csv_filename, 'r', newline='') as csv_file:
        reader = csv.DictReader(csv_file)

        # DictReader automatically uses first row (AKA 'header') as fieldnames
        # when no argument supplied for fieldnames parameter
        # so we use that default to check validity of csv fieldnames
        try:
            set_header = set(reader.fieldnames)
        except:
            import pdb;pdb.set_trace()
        if set_header != set(CSV_FIELDNAMES):

and then when you get an error, just show the filename from the pdb prompt, e.g.,

(Pdb) p `the bane of our existence: ` + csv_filename
'the bane of our existence: ./data/some-annotation.csv'

Of course be careful with editing the files in site packages since you can't set them back to the originals with git checkout or anything. You might also want to double check that you fixed the encoding back inside the cp1252.py file so that's not causing some unexpected errors--you definitely shouldn't have to touch that file! If you're getting really weird errors it might be worth just re-creating the environment from scratch.

Please do let me know what you can about the file types and we can take it from there.

marichard123 commented 3 years ago

Hi David! Thank you for the quick response- I haven't had time so far to poke around in the files some more, but for now I can upload my annotation files along with the corresponding audio _files- it is structured as a single long CSV file with many different short audio files. I will try working with some of your suggestions of how to fix the issue and get back to you ASAP CSV File and Audio Files.zip !

NickleDave commented 3 years ago

Thank you @marichard123 for sharing these files!!! I have it on my to-do list to see if I can replicate the bug with just crowsetta. Will do by the end of this weekend at the latest

NickleDave commented 3 years ago

Hi @marichard123 I am able to open the annotation file with just crowsetta alone.

I am starting to wonder if you are right, that it's literally just because of how the character encoding is set up in the env you're using.

Can you see if you still get the bug if you do the following in your environment?

(vak-env) PS C:\Users\Richard\Documents\Fall_2021\Bat_Stuff\TweetynetPipeline> ipython
In [1]: import crowsetta

In [2]: scribe = crowsetta.Transcriber(format='csv')

In [3]: annots = scribe.from_file('PipelineCSVOutput.csv')

I think it should happen when you execute that third line, if we are right about the encoding.

When I run it (on an Ubuntu-type OS) that line runs without error and I am able to do:

In [7]: annots
Out[7]: 
[Annotation(annot_path=PosixPath('C:\\Users\\Richard\\Documents\\Fall_2021\\Bat_Stuff\\TweetynetPipeline\\Logger16_16_200123_0958_VocExtractData1_mat_annotation.mat'), audio_path=PosixPath('C:\\Users\\Richard\\Documents\\Fall_2021\\Bat_Stuff\\TweetynetPipeline\\Logger16_16_200123_0958_VocExtractData1.wav'), seq=<Sequence with 2 segments>),
 Annotation(annot_path=PosixPath('C:\\Users\\Richard\\Documents\\Fall_2021\\Bat_Stuff\\TweetynetPipeline\\Logger16_17_200123_0958_VocExtractData1_mat_annotation.mat'), audio_path=PosixPath('C:\\Users\\Richard\\Documents\\Fall_2021\\Bat_Stuff\\TweetynetPipeline\\Logger16_17_200123_0958_VocExtractData1.wav'), seq=<Sequence with 2 segments>),
 ...

(which is what vak is trying to get under the hood when it crashes for you)

If you get that crash using just vak then could you please also share your environment? E.g. by creating an environment.yml file with conda and pasting the raw file into a comment, as well as attaching the file itself (in a zip, because github) as a reply?

I can try on a Windows machine and see if I can replicate.

NickleDave commented 3 years ago

Wondering if it's something like this: https://github.com/quantumblacklabs/kedro/issues/291

marichard123 commented 3 years ago

Good morning @NickleDave! Unfortunately I was incapable even of running the first few lines,

In [1]: import crowsetta

In [2]: scribe = crowsetta.Transcriber(format='csv')

In the first case, I kept running into errors of "so and so module not found", even though oftentimes I had the problem installed. I tried massaging the code a bit to direct it towards the correct pathways, but eventually I found that simply copy-pasting all the "missing" modules to the current directory at least seemed to fix the issue for now. When running the

In [2]: scribe = crowsetta.Transcriber(format='csv')

line, I receive the error "ValueError: specified vocal annotation format, csv, not installed, and noconfiguration was specified. Either install format, or specify configuration by passing as the 'config' argument to Transcriber"

Can you give some insight into what exactly installing a vocal annotation format would entail? I went into the Python source file and added a print statement

print(formats._INSTALLED)

in an attempt to see what the program considered valid vocal annotation formats, but I cannot see the print statement on the console. ipython also seems to not react at all to any changes made to the source file. as a secondary issue, do you know how I could overcome this problem? I'm assuming it has to do with the changes made not being loaded in and registered, although my attempts at implementing auto reloading of data have not had an effect.

Additionally, thank you so much for taking the time to help me with these issues! In spite of the various unexpected difficulties encountered while trying to run the program on my end, I am very grateful that you are actively helping me through them :)

NickleDave commented 3 years ago

Hi @marichard123 -- glad to help, @yardencsGitHub and I are happy that people are using the software, and we're excited about what you're working on

I'm a little bit confused about why you wouldn't be able to import crowsetta though -- it's a dependency of vak so it should be installed in your conda env.

Before you spend a bunch of time hacking crowsetta, let's figure out why it's not working as expected.

Below is a checklist with things we can do to troubleshoot. Can you please try each item, and for each item reply with a separate comment?
Please include in the replies the exact commands you enter, and the entire output in the console including full stack traces, verbatim.

Techinically we are now on 0.4.0.dev4 but please don't install that one yet, just use 0.4.0dev1 so we can get to the root of the bug. If you just can't get enough troubleshooting, you could try to create a new env with the latest dev version installed (call it, say, "vak040dev4") and then see if you still get the errors. But I really doubt that's the source of the issue--we were running 0.4.0.dev1 just fine.

If this doesn't help us work out what's going on, maybe we can have a quick Zoom meeting. But let's see what you find out.

NickleDave commented 3 years ago

Hi again @marichard123 -- just following up to say it occurred to me that I should be able to use the files you shared to test whether I can replicate the error on Windows

I will do that in the next couple of days

That won't help us figure out quite what's going on with your set-up though. Not trying to rush you but please do go ahead and reply as I asked above whenever you have time.

marichard123 commented 2 years ago

Hello! Here are each of the steps that I have taken- I confirmed that I was already working inside the conda environment, the command "conda init powershell" giving

no change     C:\Users\Richard\Anaconda3\Scripts\conda.exe
no change     C:\Users\Richard\Anaconda3\Scripts\conda-env.exe
no change     C:\Users\Richard\Anaconda3\Scripts\conda-script.py
no change     C:\Users\Richard\Anaconda3\Scripts\conda-env-script.py
no change     C:\Users\Richard\Anaconda3\condabin\conda.bat
no change     C:\Users\Richard\Anaconda3\Library\bin\conda.bat
no change     C:\Users\Richard\Anaconda3\condabin\_conda_activate.bat
no change     C:\Users\Richard\Anaconda3\condabin\rename_tmp.bat
no change     C:\Users\Richard\Anaconda3\condabin\conda_auto_activate.bat
no change     C:\Users\Richard\Anaconda3\condabin\conda_hook.bat
no change     C:\Users\Richard\Anaconda3\Scripts\activate.bat
no change     C:\Users\Richard\Anaconda3\condabin\activate.bat
no change     C:\Users\Richard\Anaconda3\condabin\deactivate.bat
no change     C:\Users\Richard\Anaconda3\Scripts\activate
no change     C:\Users\Richard\Anaconda3\Scripts\deactivate
no change     C:\Users\Richard\Anaconda3\etc\profile.d\conda.sh
no change     C:\Users\Richard\Anaconda3\etc\fish\conf.d\conda.fish
no change     C:\Users\Richard\Anaconda3\shell\condabin\Conda.psm1
no change     C:\Users\Richard\Anaconda3\shell\condabin\conda-hook.ps1
no change     C:\Users\Richard\Anaconda3\Lib\site-packages\xontrib\conda.xsh
no change     C:\Users\Richard\Anaconda3\etc\profile.d\conda.csh
no change     C:\Users\Richard\Documents\WindowsPowerShell\profile.ps1
No action taken.
marichard123 commented 2 years ago

The exact commands I entered + the error traceback:

(base) PS C:\Users\Richard> conda activate vak-env
(vak-env) PS C:\Users\Richard> cd Documents\Fall_2021\Bat_Stuff\TweetynetPipeline
(vak-env) PS C:\Users\Richard\Documents\Fall_2021\Bat_Stuff\TweetynetPipeline> ipython
Python 3.8.8 (default, Apr 13 2021, 15:08:03) [MSC v.1916 64 bit (AMD64)]
Type 'copyright', 'credits' or 'license' for more information
IPython 7.22.0 -- An enhanced Interactive Python. Type '?' for help.

In [1]: import crowsetta
---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
<ipython-input-1-af422012dd85> in <module>
----> 1 import crowsetta

ModuleNotFoundError: No module named 'crowsetta'

In [2]:
marichard123 commented 2 years ago

The full conda environment, with the zip file attached at the bottom: name: vak-env channels:

marichard123 commented 2 years ago

Trying a new virtual environment with a different name (vak-env-test-bug), I get the exact same error:

In [1]: import crowsetta
---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
<ipython-input-1-af422012dd85> in <module>
----> 1 import crowsetta

ModuleNotFoundError: No module named 'crowsetta'

In [2]:
NickleDave commented 2 years ago

thank you for doing all that @marichard123, it helps me see what's going on

I think the first issue is that I assumed ipython would be installed in your environment, but it's not. Sorry! (It's a dev dependency for vak so I did have it installed 😬)

Somewhat confusingly, conda will happily start ipython from the base environment without telling you.
If you do

(vak-env) PS C:\Users\Richard> which ipython

then I think you will get some path that is not inside C:\Users\Richard\Anaconda3\envs\vak-env, probably it's the one in base instead.
That's why it can't "see" crowsetta.

I can also tell because it's a different Python (3.8) from the one you have installed in your env (3.6).

To fix, please do: (vak-env) PS C:\Users\Richard> conda install ipython

Then try this again:

(vak-env) PS C:\Users\Richard\Documents\Fall_2021\Bat_Stuff\TweetynetPipeline> ipython
In [1]: import crowsetta

In [2]: scribe = crowsetta.Transcriber(format='csv')

In [3]: annots = scribe.from_file('PipelineCSVOutput.csv')

and please let me know what you get in that case.

marichard123 commented 2 years ago

We have found the issue! The encoding issue mentioned was a result of vak attempting to open a .notmat file as a .csv file. Originally in our initial CSV annotation file, our "annotation file" column- the sixth column- had contained the pathway names of .mat files. We changed the setup of our CSV annotation file so that this column was changed to contain the name of the CSV annotation file itself. As an example, every column in the sixth row contains the pathway string: "C:\Users\Richard\Documents\Fall_2021\Bat_Stuff\TweetynetPipeline\PipelineCSVOutput.csv" This simple fix completely fixes the aforementioned encoding problem.

NickleDave commented 2 years ago

Going to close this original issue as fixed -- others referenced above addressed the root issue