Closed marichard123 closed 2 years ago
Hi @marichard123 thank you for raising a clear detailed issue -- sorry you're having this problem.
We are very excited to work with a computer scientist that would actually think to look at the encoding used 😁 but I think your hunch is right, that a good place to start is hunting down the offending file.
Before doing that: are you able to provide a little bit more information about your annotation file(s)? Is there a single .csv file with all the annotations for every audio file, or is it one annotation file per audio file?
I'm wondering if a quick fix is to simply save the original file(s) in a different encoding. If you can tell me more about how you generated them, that might help us figure it out.
I can see that the crash occurred when crowsetta
tried to open one of them.
File "c:\users\richard\anaconda3\envs\vak-env\lib\site-packages\crowsetta\csv.py", line 220, in csv2annot
set_header = set(reader.fieldnames)
File "c:\users\richard\anaconda3\envs\vak-env\lib\csv.py", line 98, in fieldnames
self._fieldnames = next(self.reader)
File "c:\users\richard\anaconda3\envs\vak-env\lib\encodings\cp1252.py", line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
If you are okay with sharing annotation files either here or by email, I am also happy to run them through crowsetta
to see if I can diagnose what's going on.
It might be the case that we need to do a little work on the crowsetta
code to make csv loading more general -- I'd really appreciate if you can help me figure that out before we have a ton of users testing things and getting angry about encoding errors 😅
It's probably better to use something than our own hand-rolled csv read/write code anyway (as discussed in this issue)
As far as finding the file: A quick way to troubleshoot might be making a dataset with a single audio file and see if it still crashes.
It looks like the crash happened right at the start of checking annotations, so it might not be worth going to all this trouble.
But if I was going to try and track down the file, I would set a breakpoint with pdb
as far down the stack I can get.
In this case, that looks like it's inside c:\users\richard\anaconda3\envs\vak-env\lib\site-packages\crowsetta\csv.py
.
Specifically this line:
https://github.com/NickleDave/crowsetta/blob/3db98293b7babd1526dc0f28d7919181ec5b591a/src/crowsetta/csv.py#L207
with open(csv_filename, 'r', newline='') as csv_file:
reader = csv.DictReader(csv_file)
# DictReader automatically uses first row (AKA 'header') as fieldnames
# when no argument supplied for fieldnames parameter
# so we use that default to check validity of csv fieldnames
set_header = set(reader.fieldnames)
if set_header != set(CSV_FIELDNAMES):
I would edit the file to look something like this:
with open(csv_filename, 'r', newline='') as csv_file:
reader = csv.DictReader(csv_file)
# DictReader automatically uses first row (AKA 'header') as fieldnames
# when no argument supplied for fieldnames parameter
# so we use that default to check validity of csv fieldnames
try:
set_header = set(reader.fieldnames)
except:
import pdb;pdb.set_trace()
if set_header != set(CSV_FIELDNAMES):
and then when you get an error, just show the filename from the pdb
prompt, e.g.,
(Pdb) p `the bane of our existence: ` + csv_filename
'the bane of our existence: ./data/some-annotation.csv'
Of course be careful with editing the files in site packages since you can't set them back to the originals with git checkout
or anything. You might also want to double check that you fixed the encoding back inside the cp1252.py file so that's not causing some unexpected errors--you definitely shouldn't have to touch that file!
If you're getting really weird errors it might be worth just re-creating the environment from scratch.
Please do let me know what you can about the file types and we can take it from there.
Hi David! Thank you for the quick response- I haven't had time so far to poke around in the files some more, but for now I can upload my annotation files along with the corresponding audio _files- it is structured as a single long CSV file with many different short audio files. I will try working with some of your suggestions of how to fix the issue and get back to you ASAP CSV File and Audio Files.zip !
Thank you @marichard123 for sharing these files!!!
I have it on my to-do list to see if I can replicate the bug with just crowsetta
.
Will do by the end of this weekend at the latest
Hi @marichard123 I am able to open the annotation file with just crowsetta
alone.
I am starting to wonder if you are right, that it's literally just because of how the character encoding is set up in the env you're using.
Can you see if you still get the bug if you do the following in your environment?
(vak-env) PS C:\Users\Richard\Documents\Fall_2021\Bat_Stuff\TweetynetPipeline> ipython
In [1]: import crowsetta
In [2]: scribe = crowsetta.Transcriber(format='csv')
In [3]: annots = scribe.from_file('PipelineCSVOutput.csv')
I think it should happen when you execute that third line, if we are right about the encoding.
When I run it (on an Ubuntu-type OS) that line runs without error and I am able to do:
In [7]: annots
Out[7]:
[Annotation(annot_path=PosixPath('C:\\Users\\Richard\\Documents\\Fall_2021\\Bat_Stuff\\TweetynetPipeline\\Logger16_16_200123_0958_VocExtractData1_mat_annotation.mat'), audio_path=PosixPath('C:\\Users\\Richard\\Documents\\Fall_2021\\Bat_Stuff\\TweetynetPipeline\\Logger16_16_200123_0958_VocExtractData1.wav'), seq=<Sequence with 2 segments>),
Annotation(annot_path=PosixPath('C:\\Users\\Richard\\Documents\\Fall_2021\\Bat_Stuff\\TweetynetPipeline\\Logger16_17_200123_0958_VocExtractData1_mat_annotation.mat'), audio_path=PosixPath('C:\\Users\\Richard\\Documents\\Fall_2021\\Bat_Stuff\\TweetynetPipeline\\Logger16_17_200123_0958_VocExtractData1.wav'), seq=<Sequence with 2 segments>),
...
(which is what vak
is trying to get under the hood when it crashes for you)
If you get that crash using just vak
then could you please also share your environment?
E.g. by creating an environment.yml
file with conda
and pasting the raw file into a comment, as well as attaching the file itself (in a zip, because github) as a reply?
I can try on a Windows machine and see if I can replicate.
Wondering if it's something like this: https://github.com/quantumblacklabs/kedro/issues/291
Good morning @NickleDave! Unfortunately I was incapable even of running the first few lines,
In [1]: import crowsetta
In [2]: scribe = crowsetta.Transcriber(format='csv')
In the first case, I kept running into errors of "so and so module not found", even though oftentimes I had the problem installed. I tried massaging the code a bit to direct it towards the correct pathways, but eventually I found that simply copy-pasting all the "missing" modules to the current directory at least seemed to fix the issue for now. When running the
In [2]: scribe = crowsetta.Transcriber(format='csv')
line, I receive the error "ValueError: specified vocal annotation format, csv, not installed, and noconfiguration was specified. Either install format, or specify configuration by passing as the 'config' argument to Transcriber"
Can you give some insight into what exactly installing a vocal annotation format would entail? I went into the Python source file and added a print statement
print(formats._INSTALLED)
in an attempt to see what the program considered valid vocal annotation formats, but I cannot see the print statement on the console. ipython also seems to not react at all to any changes made to the source file. as a secondary issue, do you know how I could overcome this problem? I'm assuming it has to do with the changes made not being loaded in and registered, although my attempts at implementing auto reloading of data have not had an effect.
Additionally, thank you so much for taking the time to help me with these issues! In spite of the various unexpected difficulties encountered while trying to run the program on my end, I am very grateful that you are actively helping me through them :)
Hi @marichard123 -- glad to help, @yardencsGitHub and I are happy that people are using the software, and we're excited about what you're working on
I'm a little bit confused about why you wouldn't be able to import crowsetta
though -- it's a dependency of vak
so it should be installed in your conda env.
Before you spend a bunch of time hacking crowsetta
, let's figure out why it's not working as expected.
Below is a checklist with things we can do to troubleshoot.
Can you please try each item, and for each item reply with a separate comment?
Please include in the replies the exact commands you enter, and the entire output in the console including full stack traces, verbatim.
crowsetta
. You'll want to make sure you're not in that directory, so that you're not accidentally importing the local copies. This would prevent you from replicating the errorimport crowsetta
above(vak-env) PS C:\Users\Richard\Documents\Fall_2021\Bat_Stuff\TweetynetPipeline> conda env export > environment.yml
vak-env-test-bug
-- and verify that even in this new environment you get the exact same error
C:\You> conda create -n vak-env python==3.8 C:\You> conda activate vak-env (vak-env) C:\You> pip install torch===1.7.1 torchvision===0.8.2 -f https://download.pytorch.org/whl/torch_stable.html (vak-env) C:\You> pip install vak==0.4.0.dev1 (vak-env) C:\You> pip install tweetynet
Techinically we are now on 0.4.0.dev4 but please don't install that one yet, just use 0.4.0dev1 so we can get to the root of the bug.
If you just can't get enough troubleshooting, you could try to create a new env with the latest dev version installed (call it, say, "vak040dev4
") and then see if you still get the errors. But I really doubt that's the source of the issue--we were running 0.4.0.dev1 just fine.
If this doesn't help us work out what's going on, maybe we can have a quick Zoom meeting. But let's see what you find out.
Hi again @marichard123 -- just following up to say it occurred to me that I should be able to use the files you shared to test whether I can replicate the error on Windows
I will do that in the next couple of days
That won't help us figure out quite what's going on with your set-up though. Not trying to rush you but please do go ahead and reply as I asked above whenever you have time.
Hello! Here are each of the steps that I have taken- I confirmed that I was already working inside the conda environment, the command "conda init powershell" giving
no change C:\Users\Richard\Anaconda3\Scripts\conda.exe
no change C:\Users\Richard\Anaconda3\Scripts\conda-env.exe
no change C:\Users\Richard\Anaconda3\Scripts\conda-script.py
no change C:\Users\Richard\Anaconda3\Scripts\conda-env-script.py
no change C:\Users\Richard\Anaconda3\condabin\conda.bat
no change C:\Users\Richard\Anaconda3\Library\bin\conda.bat
no change C:\Users\Richard\Anaconda3\condabin\_conda_activate.bat
no change C:\Users\Richard\Anaconda3\condabin\rename_tmp.bat
no change C:\Users\Richard\Anaconda3\condabin\conda_auto_activate.bat
no change C:\Users\Richard\Anaconda3\condabin\conda_hook.bat
no change C:\Users\Richard\Anaconda3\Scripts\activate.bat
no change C:\Users\Richard\Anaconda3\condabin\activate.bat
no change C:\Users\Richard\Anaconda3\condabin\deactivate.bat
no change C:\Users\Richard\Anaconda3\Scripts\activate
no change C:\Users\Richard\Anaconda3\Scripts\deactivate
no change C:\Users\Richard\Anaconda3\etc\profile.d\conda.sh
no change C:\Users\Richard\Anaconda3\etc\fish\conf.d\conda.fish
no change C:\Users\Richard\Anaconda3\shell\condabin\Conda.psm1
no change C:\Users\Richard\Anaconda3\shell\condabin\conda-hook.ps1
no change C:\Users\Richard\Anaconda3\Lib\site-packages\xontrib\conda.xsh
no change C:\Users\Richard\Anaconda3\etc\profile.d\conda.csh
no change C:\Users\Richard\Documents\WindowsPowerShell\profile.ps1
No action taken.
The exact commands I entered + the error traceback:
(base) PS C:\Users\Richard> conda activate vak-env
(vak-env) PS C:\Users\Richard> cd Documents\Fall_2021\Bat_Stuff\TweetynetPipeline
(vak-env) PS C:\Users\Richard\Documents\Fall_2021\Bat_Stuff\TweetynetPipeline> ipython
Python 3.8.8 (default, Apr 13 2021, 15:08:03) [MSC v.1916 64 bit (AMD64)]
Type 'copyright', 'credits' or 'license' for more information
IPython 7.22.0 -- An enhanced Interactive Python. Type '?' for help.
In [1]: import crowsetta
---------------------------------------------------------------------------
ModuleNotFoundError Traceback (most recent call last)
<ipython-input-1-af422012dd85> in <module>
----> 1 import crowsetta
ModuleNotFoundError: No module named 'crowsetta'
In [2]:
The full conda environment, with the zip file attached at the bottom: name: vak-env channels:
Trying a new virtual environment with a different name (vak-env-test-bug), I get the exact same error:
In [1]: import crowsetta
---------------------------------------------------------------------------
ModuleNotFoundError Traceback (most recent call last)
<ipython-input-1-af422012dd85> in <module>
----> 1 import crowsetta
ModuleNotFoundError: No module named 'crowsetta'
In [2]:
thank you for doing all that @marichard123, it helps me see what's going on
I think the first issue is that I assumed ipython
would be installed in your environment, but it's not. Sorry!
(It's a dev dependency for vak
so I did have it installed 😬)
Somewhat confusingly, conda
will happily start ipython
from the base environment without telling you.
If you do
(vak-env) PS C:\Users\Richard> which ipython
then I think you will get some path that is not inside C:\Users\Richard\Anaconda3\envs\vak-env
, probably it's the one in base
instead.
That's why it can't "see" crowsetta
.
I can also tell because it's a different Python (3.8) from the one you have installed in your env (3.6).
To fix, please do:
(vak-env) PS C:\Users\Richard> conda install ipython
Then try this again:
(vak-env) PS C:\Users\Richard\Documents\Fall_2021\Bat_Stuff\TweetynetPipeline> ipython
In [1]: import crowsetta
In [2]: scribe = crowsetta.Transcriber(format='csv')
In [3]: annots = scribe.from_file('PipelineCSVOutput.csv')
and please let me know what you get in that case.
We have found the issue! The encoding issue mentioned was a result of vak attempting to open a .notmat file as a .csv file. Originally in our initial CSV annotation file, our "annotation file" column- the sixth column- had contained the pathway names of .mat files. We changed the setup of our CSV annotation file so that this column was changed to contain the name of the CSV annotation file itself. As an example, every column in the sixth row contains the pathway string: "C:\Users\Richard\Documents\Fall_2021\Bat_Stuff\TweetynetPipeline\PipelineCSVOutput.csv" This simple fix completely fixes the aforementioned encoding problem.
Going to close this original issue as fixed -- others referenced above addressed the root issue
When running the prep stage, at the point after which I believe the spectrograms are created, I get the following error message (I have attached the full error traceback at the end of the message):
File "c:\users\richard\anaconda3\envs\vak-env\lib\encodings\cp1252.py", line 23, in decode return codecs.charmap_decode(input,self.errors,decoding_table)[0] UnicodeDecodeError: 'charmap' codec can't decode byte 0x90 in position 164: character maps to
Initially I thought it was a problem of Python not being set to the correct encoding standard, so inside the cp1252.py file I tried manually setting the encoding procedure to ANSI and UTF-8, which the created text files were created in, with no success. I then noticed that in the second half of the file there was a decoding table, a tuple in which the encoding codes were all manually listed. Among them were codes mapping to "undefined". Byte 0x90 indeed maps to 'undefined'. In other words, it seems to me that rather than a case of 0x90 not being defined in whatever encoding procedure Python is using due to encoding mismatch, that 0x90 is hard-coded to map to "undefined", and that the problem lies within whatever file the program is reading from. I'm not sure how to identify the file that's causing the problem/how to pinpoint what exactly is causing the 0x90 to appear in it. Have you run into a similar error during development/would you have any insight into the nature of the problem?
Full error traceback: