BUG: Issue converting raven.txt file to simple-seq

sfcooke96 commented 6 months ago

Hi there @NickleDave,

I'm running the following on a MAC with crowsetta V 5.0.1

I tried using the following script (suggested here: https://github.com/yardencsGitHub/tweetynet/issues/223) to convert my raven.txt files to simple-seq for use with vak and tweetynet.

import crowsetta
import numpy as np

example = crowsetta.data.get('raven')
raven = crowsetta.formats.bbox.Raven.from_file(example.annot_path, annot_col='Species')
annot = raven.to_annot()
onsets_s = []
offsets_s = []
labels = []
for bbox in annot.bboxes:
    onsets_s.append(bbox.onset)
    offsets_s.append(bbox.offset)
    labels.append(bbox.label)
onsets_s = np.array(onsets_s)
offsets_s = np.array(offsets_s)
labels = np.array(labels)
simpleseq = crowsetta.formats.seq.SimpleSeq(
    onsets_s=onsets_s,
    offsets_s=offsets_s, 
    labels=labels,
    annot_path='/dummy/path'
)
simpleseq.to_csv('example-data.csv')

After running this I got:

AttributeError: 'SimpleSeq' object has no attribute 'to_csv'

I adjusted the script slightly (raven = .... , simplest.to_file...) to the following:

import crowsetta
import numpy as np

example = crowsetta.data.get('raven')
raven = crowsetta.formats.bbox.raven.Raven.from_file(example.annot_path, annot_col='Species)
annot = raven.to_annot()
onsets_s = []
offsets_s = []
labels = []
for bbox in annot.bboxes:
    onsets_s.append(bbox.onset)
    offsets_s.append(bbox.offset)
    labels.append(bbox.label)
onsets_s = np.array(onsets_s)
offsets_s = np.array(offsets_s)
labels = np.array(labels)
simpleseq = crowsetta.formats.seq.SimpleSeq(
    onsets_s=onsets_s,
    offsets_s=offsets_s, 
    labels=labels,
    annot_path='/Users/training_data'
)

simpleseq.to_file("data.csv")

I have 10 .txt files in my directory (> 15 rows per file) to be written into simple-seq format but the resulting output is the following (this is complete):

onset_s,offset_s,label
154.387792767,154.911598217,EATO
167.526598245,168.17302044,EATO
183.609636834,184.097751553,EATO
250.527480604,251.160710509,EATO
277.88724277,278.480895806,EATO
295.52970757,296.110168316,EATO

I tried adjusting the above code

raven = crowsetta.formats.bbox.**raven**.Raven.from_file(example.annot_path, annot_col='Species)

By changing annot_col to 'Annotation' - the header for the annotation col in my .txt files. - and received the following output:

(tweetynet) Stephens-MacBook-Pro:HAV_TN_Training stephencooke$ python test.py 
Traceback (most recent call last):
  File "/Users/stephencooke/Library/CloudStorage/OneDrive-UniversityofArizona/Tweetynet/HAV_TN_Training/test.py", line 6, in <module>
    raven = crowsetta.formats.bbox.raven.Raven.from_file(example.annot_path, annot_col='Annotations')
  File "/Users/stephencooke/miniconda3/envs/tweetynet/lib/python3.9/site-packages/crowsetta/formats/bbox/raven.py", line 107, in from_file
    df = RavenSchema.validate(df)
  File "/Users/stephencooke/miniconda3/envs/tweetynet/lib/python3.9/site-packages/pandera/api/pandas/model.py", line 306, in validate
    cls.to_schema().validate(
  File "/Users/stephencooke/miniconda3/envs/tweetynet/lib/python3.9/site-packages/pandera/api/pandas/container.py", line 375, in validate
    return self._validate(
  File "/Users/stephencooke/miniconda3/envs/tweetynet/lib/python3.9/site-packages/pandera/api/pandas/container.py", line 404, in _validate
    return self.get_backend(check_obj).validate(
  File "/Users/stephencooke/miniconda3/envs/tweetynet/lib/python3.9/site-packages/pandera/backends/pandas/container.py", line 97, in validate
    error_handler = self.run_checks_and_handle_errors(
  File "/Users/stephencooke/miniconda3/envs/tweetynet/lib/python3.9/site-packages/pandera/backends/pandas/container.py", line 172, in run_checks_and_handle_errors
    error_handler.collect_error(
  File "/Users/stephencooke/miniconda3/envs/tweetynet/lib/python3.9/site-packages/pandera/error_handlers.py", line 38, in collect_error
    raise schema_error from original_exc
pandera.errors.SchemaError: column 'annotation' not in dataframe
   Selection           View  Channel  begin_time_s  end_time_s  low_freq_hz  high_freq_hz Species
0          1  Spectrogram 1        1    154.387793  154.911598       2878.2        4049.0    EATO
1          2  Spectrogram 1        1    167.526598  168.173020       2731.9        3902.7    EATO
2          3  Spectrogram 1        1    183.609637  184.097752       2878.2        3975.8    EATO
3          4  Spectrogram 1        1    250.527481  251.160711       2756.2        3951.4    EATO
4          5  Spectrogram 1        1    277.887243  278.480896       2707.5        3975.8    EATO

I've attached example data here, the python script, and output file. troubleshooting.zip

Another question while we're here: will training the model on simple-seq annotations restrict the predicted annotations to onset - offset borders without including high and low frequency bounds? I'm interested because I was hoping to estimate frequency ranges with the output data. Apologies if I'm misunderstanding how prediction output will be formatted.

Thanks for your help!

NickleDave commented 6 months ago

Hi @sfcooke96!

Thank you for providing a detailed bug report and the zip with a couple samples to test with. :pray:

I think I might have confused you with my snippet on the other issue.

When you use your data, you'll want to specify the path to those files as the first argument to crowsetta.formats.bbox.Raven.from_file, like so:

crowsetta.formats.bbox.Raven.from_file(
    'troubleshooting/data1.txt'
)

I was able to do this and load the file without issue.
You don't need to specify the annot_col since it has the default name for Raven (the example data we have is from a dataset that uses a different name for their annotations column). Seems like we handle extra columns gracefully (I guess I programmed the class better than I thought :smirk: ).

You'll also need to loop over all your files and save each of them with a separate name, so you don't overwrite the previous one you saved.
Please try this short script and see if you get separate files, each with the appropriate number of rows.

import pathlib

import crowsetta
import numpy as np

# this is where we get our files from
src_dir = pathlib.Path('./troubleshooting')
# next line: sorted because 
# https://www.vice.com/en/article/zmjwda/a-code-glitch-may-have-caused-errors-in-more-than-100-published-studies
src_txt_files = sorted(src_dir.glob('*.txt'))

# this is where we save the files (so we don't overwrite the originals)
dst_dir = pathlib.Path('./annots-simple-seq')
dst_dir.mkdir(exist_ok=True)

# to save ourselves from a typo
assert dst_dir != src_dir

for txt_file in src_txt_files:
    print(
        f"Converting Raven file to simple-seq format: {txt_file}"
    )
    annot = crowsetta.formats.bbox.Raven.from_file(
        txt_file
    ).to_annot()

    onsets_s = []
    offsets_s = []
    labels = []
    for bbox in annot.bboxes:
        onsets_s.append(bbox.onset)
        offsets_s.append(bbox.offset)
        labels.append(bbox.label)
    onsets_s = np.array(onsets_s)
    offsets_s = np.array(offsets_s)
    labels = np.array(labels)
    simpleseq = crowsetta.formats.seq.SimpleSeq(
        onsets_s=onsets_s,
        offsets_s=offsets_s, 
        labels=labels,
        annot_path='/dummy/path/doesnt/matter/here'
    )
    dst_txt_file = dst_dir / txt_file.name
    print(
        f"Saving converted simple-seq file: {dst_txt_file}"
    )
    simpleseq.to_file(dst_txt_file)

Just let me know if you have any questions about what this is doing!
Happy to share the ~five things I've managed to learn about Python and just keep recycling :stuck_out_tongue_winking_eye:

Another question while we're here

Re: the TweetyNet model, please see my reply on the issue on the TweetyNet repo: https://github.com/yardencsGitHub/tweetynet/issues/223#issuecomment-1905024896

sfcooke96 commented 6 months ago

@NickleDave, thank you - this solution seems to have worked! On to prepping, training, and predicting.

Thanks a lot for your active support here! 🙏

NickleDave commented 5 months ago

Of course, glad to hear it's working @sfcooke96!
I will go ahead and close this issue.

vocalpy / crowsetta

BUG: Issue converting raven.txt file to simple-seq #261