Metadata for sample packs

yaxu commented 3 years ago

Shall we agree a simple metadata format for samples? We discussed before under #15 but it didn't go anywhere, and it's come up again in the forum here: https://club.tidalcycles.org/t/what-if-there-was-a-sample-pack-format-with-metadata-for-superdirt/2746/6

General human-readable description of the sample set
License and copyright information
Per sample dictionary. Could potentially follow something like this http://motools.sourceforge.net/doc/audio_features.html and/or just an ad-doc lookup table.

It could be json, like this (this is the first json I've ever hand-written)?

{"name": "gabba",
 "description": "Selection of samples taken from the gabba kit from the Techno xpansion board in the Roland JV1080",
 "author": "Alex McLean",
 "license": "CC0",
 "files": {
    "000_0.wav": {"description":"Distorted Kick", "centroid":4000, "keywords": ["kick"]}
 }
}

Would something like that work for SuperDirt @telephon ?

cleary commented 3 years ago

This looks fine to me, I have no idea what centroid is, but a google will fix that ;)

Is it worth including it's normalise gain info as well (per track, or per set)?

yaxu commented 3 years ago

The spectral centroid is a measurement of 'brightness', I was mainly thinking about structure rather than content - a single file with a simple key-value structure for the sample pack, with a subdictionary for individual files.

But yes would be great to include a soft gain adjustment for superdirt in there

yaxu commented 3 years ago

Perhaps the samples could (optionally?) be specified by URL, e.g. to individual samples on https://freesound.org

telephon commented 3 years ago

Would something like that work for SuperDirt @telephon ?

Yes, it would work very well.

Would the metadata influence the order of indices (the # n parameter) for samples?
Something like a centroid is light and easy. How would we deal with larger datasets, like waveset onsets, for example? Would one supply a path to the data?

yaxu commented 3 years ago

Would the metadata influence the order of indices (the # n parameter) for samples?

I hadn't thought about it, but nice idea - I think that would make more sense than alphabetical order. Still, metadata could be optional and then alphabetical order would be the default. I don't know what should happen if there is metadata for some samples, and not others - maybe best to ignore the ones without metadata in that case.

That makes me think that it could be nice to put the whole thing in a list, so you could have more than one sampleset in one subfolder. Then people could organise things in the filesystem how they want, and a file could have a path gabba/kicks/fruity_3.wav

Something like a centroid is light and easy. How would we deal with larger datasets, like waveset onsets, for example? Would one supply a path to the data?

Yes perhaps a path relative to the sample folder, encouraging a loose standard like a data/ folder?

telephon commented 3 years ago

Still, metadata could be optional and then alphabetical order would be the default.

Currently, the default is what pathMatch returns from the file system. That is probably an ambiguous order that may well depend on the file system itself rather than only the file names. Ordering alphabetically might change things for some users, but I am not sure.

That makes me think that it could be nice to put the whole thing in a list, so you could have more than one sampleset in one subfolder. Then people could organise things in the filesystem how they want, and a file could have a path gabba/kicks/fruity_3.wav

Do you mean a list of json metadata entries?

Currently, using loadSoundFiles you can load any kind of folder structure, you can just pass in a path with wildcards or make the list of paths yourself. It is only loadSoundFileFolder which is at the lowest level.

Should we try to keep the metadata as close as possible to the samples?

cleary commented 3 years ago

I'm sitting down to begin this, I think we need to make some stronger decisions in particular surrounding copyright and licensing accuracy (which is what this endeavour is primarily in aid of, if I'm reading/understanding correctly?)

debian has a decent manual on it's machine readable copyright specification and the things I think we should take away from this are:

1. Different files may come from different sources, with different copyright and licenses - therefore we should be able to specify by file or groups of files.

2. All samples must be accounted for by the copyright stanzas

Here's my first attempt at doing my flbass samples, which have the samples in the root directory (which is a little uncommon as I'm finding out, but makes it easier to create collections in a git repo using git-submodules):

{"name": "flbass",
  "description": "Fretless electric bass samples - tuned to C with various attack styles, releases, plucking hand positions and harmonics",
  "folders": {
    "./": {
      "copyright": "2021 Bernard Gray <bernard.gray@gmail.com>",
      "license": "CC0",
      "website": "https://github.com/cleary/samples-flbass",
      "files": {
        "00_c2_finger_long_neck.wav": {
          "description":"fingerstyle neck position, 8s natural release", 
          "pitch":"C2",                                            
          "keywords": ["bass", "guitar", "tuned"] 
        },
        "01_c2_finger_short_neck.wav": {
          "description":"fingerstyle neck position, 0.5s short release", 
          "pitch":"C2", 
          "keywords": ["bass", "guitar", "tuned"] 
        }
      }
    }
  }
}

In the case of them being in one or more subfolders, "./" could be modified appropriately, for example "./jv1080", but is flexible enough that the path could also be left as "./*" and the file names could be prefixed with folder paths, eg "./jv1080/sample1.wav"

It'd also be nice to have the keywords be inheritable so I can define them at the folder level AND at the file level.

What are your thoughts? I'm pretty new to designing specs (I use them a bit) so I'm going to be coming from a very novice perspective.

Would the metadata influence the order of indices (the # n parameter) for samples?

I really like this idea - I've raised/contributed to related issue discussions on a couple of plugins re file ordering inconsistencies: https://github.com/musikinformatik/SuperDirt/issues/131

yaxu commented 3 years ago

@MrMebelMan contacted me asking about dirt-samples licensing for a foxdot oriented linux distro project that looks very nice. They have been collecting samples for this, it would be great to pool resources and maybe collaborate on a cross-language samplepack format.

iShapeNoise commented 3 years ago

Hi, I am working on the samples of @MrMebelMan 's FoxDot branch "killa_features". The idea to collaborate on a sample pack with a certain naming convention and meta-data standard is appealing.

How about having a loose conversation via Telegram or similar?

We can brainstorm about it and see where it goes.

e.g. At the moment I use 000_name_drumkit/author/category.wav at the moment >> 001_HiHatOpen_Sndauthor.wav or 002_BassDist_DnB.wav or 003_SnareClosed_PearlV1.wav Is there a better version of a naming convention, that makes it human readable to skip throu samples?"

What do you think about it?

Cheers

charlieroberts commented 3 years ago

just pinging this as I'm really interested for gibber as well

yaxu commented 3 years ago

Ok I've made a new (currently empty) repository that will ultimately become the default samples for Dirt/SuperDirt: https://github.com/tidalcycles/Clean-Samples/

It would be awesome if this could be a nexus for collaboration with other systems like gibber and foxdot, I'd be happy to move it to a shared org then. But this stuff has been hanging around for a while and I think best to make a start!

So lets start collecting some choice sample banks there, and then discuss things like file structure, metadata, normalisation etc in the issues on that repository. Sound good?

telephon commented 3 years ago

Excellent! I can add the corresponding reader methods in SuperDirt.

yaxu commented 3 years ago

Thinking a bit more, it would be nice to have a utility that reads a folder of samples and writes a metadata file that a passing human can then edit with license info and so on.

I was thinking of making this in a ubiquitous scripting language (python?) but then thought maybe it could aso be done in SuperDirt. It could read a file if it's present, default any missing info, then write the structure back.

Then again if we want this format shared with gibber, then a python utility would be a good move as a reference implementation that doesn't require a sclang installation. I think I'll start something in python..

yaxu commented 3 years ago

Well there's a lot of related issues around, lets try to continue conversation on the clean-samples issue tracker. I see Charlie's already there https://github.com/tidalcycles/Clean-Samples/issues

tidalcycles / Dirt-Samples

Metadata for sample packs #20