Repository scope? - Githubissues

mhpob commented 8 months ago

After recent discussions with @jdpye, @chrisholbrook, @benjaminhlina, @franksmithxyz, and others regarding surimi (#1), glatos, VR2 VRL and binary formats, VR3 formats, and others, it seems like there is appetite to create some sort of open repository of characteristic types. As this repository is still in concept stage, I want to put forth a few ideas and discussion points related to its intent and scope to see if it can fill this space.

This should be viewed as a draft to be edited, added to, knocked back, or refuted outright -- nothing is critical and things noted here could already be in place elsewhere or merit separate, dedicated support. It will be clear to anyone reading that I am blatantly inserting my own hopes and dreams within the myopic view of my own experiences. I'll update everything below as comments come in if this becomes a worthwhile forum.

Statement of problem

Multiple acoustic telemetry data formats exist across vendors, networks, and investigators
There is currently no open clearinghouse of these data formats
Data formats have shown to be ephemeral and disappear from common knowledge following industry-standard updates
Data can be considerable in size, so including them within a package preclude them from some repositories
An open repository of data formats is necessary for reproducible package and workflow testing and development

Intended scope

From #1: "this ... will serve as a repo for general industry-standard file types and ensuring that we or anyone who cares to can handle them properly"
Language agnostic: Provide examples of current and legacy data formats, forms, and schema; and associated file metadata
R-specific: Provide a CI/CD-controlled branch for use as a drat/R-Universe repository (see #1)

Data types to be included (high-level)

Raw
- Vemco/Innovasea (Current: VRL, VDAT; Legacy: binary, text)
- Lotek
- ThelmaBiotel
- Sonotronics
Derived
- Vemco/Innovasea ("non-truth" VRL, CSV)
- OTN matched/unmatched/qualified/unqualified/other networks
- GLATOS and glatos
- ETN and etn
- IMOS
- Actel
- Deployment data, various forms
Forms
- OTN/FACT/MATOS/ETN/GLATOS metadata forms (multiple versions)
Schema
- OTN (multiple across Geoserver and exports)
- GLATOS

Possible structure (v0.2, mirroring the above)

demo-data/
  |-- Raw/
     |-- Vendor1/
          |-- InstrumentA/
               |-- version 1.0/
                    |-- file123.ext
                    |-- file123.md (metadata: markdown? XML?)
                    |-- CITATION.cff (how to cite the data source)
          |-- ...
     |-- Vendor2/
     |-- ...
  |-- Derived/
     |-- Network1/
     |-- Network2/
     |-- ...
  |-- Schema/
  |-- Forms/

Is it better to organize according to network rather than data type?

jdpye commented 8 months ago

I endorse the structure for the filesystem that you're laying out here. Network is probably correct as the higher-order folder for derived dataset formats, groups like GLATOS combine data in a single omnibus workbook, other networks split them into separate files all needing their own examples.

mhpob commented 8 months ago

@jdpye the first draft didn't respect the spacing of the outlined repo structure, so I put it into a code block. I also went into some sub-directories and shifted note to "v0.2". Is it still agreeable as it's now outlined?

benjaminhlina commented 8 months ago

This looks great @mhpob and the overall repo structure makes sense to me. The little telemetry-workflow repo I made to help students in the Cooke lab, follows a similar structure as you've put forth here allowing the user to find things based on vendor or networks which I like.

chrisholbrook commented 7 months ago

+1 for overall structure (structure around vendor/network). Our early scoping of a "characteristic" raw file set for GLATOS was rather daunting due to the number of possible combinations of receiver model, firmware, code map, transmitter options (e.g., various sensors), receiver options (e.g., internal transmitter settings, various receiver sensors), offload software, preference for small files, and desire to include files with errors/issues. Our next step (WIP) is to create a table/list of desired characteristics, then go out in search for files that meet each.

mhpob commented 7 months ago

...number of possible combinations of receiver model, firmware, code map, transmitter options (e.g., various sensors), receiver options (e.g., internal transmitter settings, various receiver sensors), offload software, preference for small files, and desire to include files with errors/issues

Great idea to include various iterations. Does seem rather daunting as it immediately makes this kind of project quite large... possibly beyond the capability of a GitHub repo. The benefit of an open repo would be the ability to crowd-source some of these files via pull requests while maintaining an open record of the transaction.

Our next step (WIP) is to create a table/list of desired characteristics, then go out in search for files that meet each.

Currently the "searching" may be the bear that leans a lot on your time and energy. Might there be merit to putting the list out there via something like this and seeing what is submitted to you?

mhpob commented 7 months ago

GitHub recommends repos smaller than 1 GB with a max of 5 GB due to performance. May also then wind up requiring some hands-on management and git-fu: https://docs.github.com/en/repositories/working-with-files/managing-large-files/about-large-files-on-github

chrisholbrook commented 7 months ago

So I guess my question is which files, specifically, do you you/we really want here? E.g., the GLATOS data system has >20,000 VRL files. So identifying an optimal set will require first identifying desired characteristics/features.

mhpob commented 7 months ago

That's a really critical question, and highlights that I completely left any transmitter examples off of that repo structure. My initial thought is that a complete data example library (individual files for receiver x transmitter x software options) is the holy grail, but the possibility of that is questionable at best.

Highly desirable on "history of science" principles
Having one representative file each already puts us over 5K files using combinations of the the off-the-cuff iterations you noted above
Based on size, alone, may be beyond the scope of something hosted on GitHub, and so beyond the scope of this repo
However, since transmitters are really only viewed through the lens of the receiver and decoder that logged them, receiver x firmware combinations should drastically reduce the number of files needed as long as a transmitter type was detected
May put us back in the size/scope of a GitHub repo

What I think we're getting at is a metadata question -- can we design applicable metadata to not only log what we do have, but log its deficiencies? I.e., can we stand up something that's good-enough, but "not let the perfect be the enemy of the good"?

data-repo v.0.0.1 has a VRL that contains X and Y but not Z as noted in the metadata.
Scientist a provides a PR that has a VRL with X, Y, and Z.
Cue v0.1.0 or whatever the appropriate versioning is that has the new VRL and metadata

jdpye commented 7 months ago

I swear i was authoring a reply that featured the phrase 'don't let the perfect be the enemy of the good' and i let it languish in a tab.

I agree that we should take the files we're currently leaning on for testing glatos / surimi / remora / TelemetryWorkflow / etc... and then we allow users to supply extra files that don't yet exist on a needs basis and roll them into the mix. If we outgrow a regular repo there's a GitHub LFS option or there are other WAF-ish things we could try. But for now this has the right mix of 'others can suggest updates' and 'we know how it versions things and generally how it works' to be a reasonable solve.

mhpob commented 7 months ago

@jdpye I know I've already stepped all over your toes here, but would you accept PRs following the guidance above to start fleshing this thing out?

mhpob commented 7 months ago

Also, since the OTNDC is basically one big metadata factory -- any views on what that structure should be? Tabular is human readable but maybe not the most efficient; XML is all over the place but it's not super approachable (at least to me); JSON might be a compromise that would also slot into an API; some CI/CD that takes one and creates the others?

jdpye commented 7 months ago

I don't mind a PR one bit, I just picked a poor week to take vacation. :)

I'm also a big yaml fan.

mhpob commented 7 months ago

Possibly useful reference; R-centric: https://music.dataobservatory.eu/documents/open_music_europe/dataset-development/dataset-working-paper.html

mhpob commented 7 months ago

Re: building a package in another repo based on changes in this one https://medium.com/hostspaceng/triggering-workflows-in-another-repository-with-github-actions-4f581f8e0ceb

ocean-tracking-network / latitudes

Repository scope? #2

Statement of problem

Intended scope

Data types to be included (high-level)

Possible structure (v0.2, mirroring the above)