Real syllables being labeled as "noise" (point label = -1)

ArKornreich commented 7 months ago

Hi!

First of all, thank you so much for your earlier help!

I have gotten everything to run smoothly up to this point, but as I try to open my dataset in the app post-segmentation, I am finding a large amount of syllables being labelled as noise. Given that I did a careful and pretty thorough job of noise-reduction before introducing my recordings into pykanto, there ought not to be much noise left to disregard. Is there anything I can do about this?

Thank you so much!

Ar K

nilomr commented 7 months ago

Hi Ar,

Great to hear!

In this context, 'Noise' just refers to data that hasn't been assigned to a cluster. Pykanto utilizes UMAP and HDBSCAN for a preliminary classification, which you can then adjust interactively. The number of data points without cluster membership depends on:

i) The nature of the data, ii) How you configure the dimensionality reduction algorithm, and iii) The parameters used for clustering.

There's no universal set of parameters that work well for all situations. This is because some datasets may defy assumptions made by each algorithm, and in many cases, discrete population-wide categories might not exist.

See:

Also, see this bit from the app notes

Limitation 2: [...] the clustering process will work increasingly poorly with those [species] that have a large number of very variable elements. This is true of any clustering method: they will fail or produce spurious results if variation in the data is continuous.

If you attach a couple of screenshots of the interactive app I can also try to give you more targeted advice.

Hope that helps — Nilo

ArKornreich commented 7 months ago

Thank you so much for the speedy response!

This helped considerably, deepest thanks!

ArKornreich commented 7 months ago

Shoot! One more question.

Once labeling occurs, is it possible to view/get data from songs as a sequence of these new lables? For instance, if I get syllable/unit clusters, A, B, C, D, E, F, and G, is there a way I could see one of the songs in the dataset as CABGDEF or something like this?

Thank you again!

Best,

Ar K

nilomr commented 7 months ago

Yes - here you go! https://gist.github.com/nilomr/fd72373b7c2aaf0a717c151d7afa5244

There are no explicit ways to do this in pykanto, the example above is a simple one. The idea is that you end up with common python data structures (lists, pandas dataframes) so you can do whatever you need with the data while keeping the format standardised.

nilomr / pykanto

Real syllables being labeled as "noise" (point label = -1) #32