pratyushasharma / sw-combinatoriality

Dataset and Codebase for paper on "Contextual and Combinatorial Structure in Sperm Whale Vocalisations"
21 stars 6 forks source link

Adding csv with Tempo, Ornamentation, Rhythm, creating txt-based "dialogue" #2

Open morganrivers opened 1 month ago

morganrivers commented 1 month ago

Hi,

I wanted to use the data from the paper, but I saw that a lot of the useful categorization had not really been implemented on all the data. I didn't change any of your code, just added some code and datasets building on your stuff.

So first I created an Augmented csv that contains the Rhythm and Ornamentation and Tempo directly in there, so future researchers can just access those values directly for each coda.

Also I like the book of whale pdfs, but I thought it would be nice to condense the existing plots you generate into a text-based format. Of course it's more useful in training LLM's, but also it's good for human readability in some ways too. So I made a text-based dialogue that looks like this (also I put the part of the book that this text corresponds with): ScreenShot_2024-09-09_at_12:34:27-AM

File: sw061b
Whale 1:  r3  r5  c3 \c3 -c3 -c3 \c3.
In chorus, whales 1, 2: -c3  a3.
In chorus, whales 1, 2: -c3  C4.
In chorus, whales 1, 2: /c3  a4.
Whale 2: /a4.
In chorus, whales 1, 2:  c4  a5.
In chorus, whales 1, 2: \c4  a4.
In chorus, whales 1, 2: -c4 \a4.
In chorus, whales 1, 2: \c4  E5.
Whale 2:  a4 /a4.
Whale 1: /c4.
Whale 2: -a4.
Whale 1: -c4.
Whale 2: /a4.
Whale 1: -c4.
In chorus, whales 1, 2: \c4 \a4.
In chorus, whales 1, 2: \c4 \a4.
Whale 2: /a4 \a4 -a4.
Whale 1:  r5.
In chorus, whales 1, 2: \R5 -a4.

(No vocalizations, 25 seconds)

Whale 2:  a3.
In chorus, whales 1, 2:  r5  a4.
In chorus, whales 1, 2:  r4  a3.
In chorus, whales 1, 2:  R5 -a3.
In chorus, whales 1, 2:  r4 -a3.
Whale 1: /r4  r5.
In chorus, whales 1, 2:  c3  r5.
Whale 1: -c3  Q4.

The the / or - or \ indicates Rubato, the letters distinguish the 17 possible Rhythms (a->0,...,r->17), the capitalization indicates ornament, and the number indicates tempo 1 through 5.

I converted the whole dataset into this format.

You can look at the two python files and the csv and txt file I added for more specifics.

morganrivers commented 1 month ago

Having looked at this in more detail, the sequence of the pickle files does not seem to perfectly match the chronological sequence of timestamped whale data, so while the "script" generally matches the book of whale pdf's, it does have some subtle issues. A colleague and I have been working on re-interpreting the raw ICI's into rhythm and ornamentation categories, and have trained a very small transformer to predict ICI's. This separate repository should soon produce a separate script, but more accurately. whale-gpt

Incidentally, if you could possibly provide any more data with timestamps, that would be really amazing! LLM's are of course very data hungry. We would love to have more click data (critically, tagged with the whale originating, and timestamp of each click or coda).