Archive metadata and licensing --> js discussion

eminence commented 9 years ago

For each archive, we need a standard way to record some metadata with the archive. At the moment, the most important thing to include is licensing information, but we may find other information that we would like to require.

This issue is to track the discussion on this topic. Below is a draft proposal, with two examples. All aspects of this proposal are open for discussion.

Metadata should be stored in a file called _Metadata.json. The name is designed so that I'll appear near the top of directory listings.
The json object is a dictionary with the following keys:
- title -- Provides a name for the archive
- description -- A more verbose description, if needed
- source -- Lists of URLs where this data came from
- license -- An array of dictionaries listing the relevant licenses. Each has the following keys:
- summary -- a brief summary of the license
- source -- Where to find the license/legal terms in full
- last_synched -- an ISO 8601 timestamp indicating the last time this archive was updated
I think to start "license" and "title" should be required, others can be optional

For two concrete examples, see the metadata for #23 and the metadata for #18

Other thoughts:

Should the metadata include maintainer information?
Should the metadata include the script/tool that was used to sync/update the archive? might be useful is the current maintainer goes away

CC #5 for related discussion

davidar commented 9 years ago

:+1:

However, instead of inventing our own format, ideally we could use an existing standard. For example:

davidar commented 9 years ago

I think we should separate metadata into two categories:

machine-readable only, such as timestamps and hashes, which human end-users aren't likely to care about
both machine- and human-readable, such as descriptions and licenses

For (1) I'm perfectly happy to just dump a (hidden) .metadata.json in the root directory, with whatever format is used by the tool used to update the archive.

For the second, I think we should use either (or both):

human-readable HTML + machine-readable tags
human-readable Markdown (or similar) + machine-readable YAML header (like Jekyll uses)

under the conventional README and LICENSE filenames. Personally I'm in favour of Markdown+YAML, and we can include a copy of the markdown viewer webapp

To answer your other questions:

Should the metadata include maintainer information?

Yes, I'd say to include this in the license: e.g. "Original source blah, processed and uploaded to IPFS by blah"

Should the metadata include the script/tool that was used to sync/update the archive? might be useful is the current maintainer goes away

Definitely, I think it's even been suggested to put a copy of the tool within the archive itself. IPFS de-duplication means this has no more overhead than a link.

eminence commented 9 years ago

What would be the purpose of including hashes? IPFS itself will ensure data integrity.

Something like Markdown or YAML sounds find. I'd rather not use HTML, because HTML is not very friendly if you don't have a web browser to render it

davidar commented 9 years ago

What would be the purpose of including hashes? IPFS itself will ensure data integrity.

Some protocols (like rsync) supporting checking the hash of a remote file to see if it has changed. I'm basically talking about any metadata that the update tool can use internally to make its job easier.

Something like Markdown or YAML sounds find. I'd rather not use HTML, because HTML is not very friendly if you don't have a web browser to render it

Agreed. Specifically I'm proposing something like:

README.md:

---
title: arXiv
source: http://arxiv.org/
authors:
  - arXiv contributors
  - IPFS archivists
updated: 2015-03-14
---
This is a mirror of the [Creative Commons](http://creativecommons.org)
subset of [arXiv](http://arxiv.org).

Yada yada

LICENSE.md:

---
license: http://creativecommons.org/licenses/by-sa/3.0/
title: CC-BY-3.0
morePermissions: blah
attributionURL:
  - http://arxiv.org
  - http://ipfs.io
attributionName: arXiv, IPFS
---
You are free to:

    Share — copy and redistribute the material in any medium or format
    Adapt — remix, transform, and build upon the material 

Yada yada

We also need to account for the fact that archives may have different licenses for different parts, in which case I'd suggest placing a separate LICENSE file into the relevant directories.

@eminence @jbenet Thoughts?

jbenet commented 9 years ago

:-1: on frontmatter. i think it confuses most people.
a package.jsonld based on OKFN's or npm's would probably work well.
we should try to use existing formats here if possible
typical to include purely verbatim license files

davidar commented 9 years ago

:-1: on frontmatter. i think it confuses most people.

Fair enough. I meant it as a more readable alternative to the HTML microformats recommended by Creative Commons (and many others), which are even more confusing.

a package.jsonld based on OKFN's

:+1: Thanks, that looks even better.

or npm's would probably work well.

:-1: Yeah... I'm not drinking the NodeJS kool-aid ;)

we should try to use existing formats here if possible typical to include purely verbatim license files

Sorry, I should have provided a reference, as I'm not the first person to propose something like this:

http://blog.martinfenner.org/2013/06/29/metadata-in-scholarly-markdown/

but I agree that it's not exactly widespread (yet :).

jbenet commented 9 years ago

:-1: Yeah... I'm not drinking the NodeJS kool-aid ;)

Well, the OKFN data-package.json is directly derived from npm's package.json.

It turns out that node is one of the best programming systems out there, thanks to npm. npm got so much extremely right. The assumption that "it's js, it has to be bad" is so absurdly wrong. It beats go get/vendor, cabal, gem, and so on. cargo promises to be on the ballpark, mostly because it copied npm in all the important things.

http://blog.martinfenner.org/2013/06/29/metadata-in-scholarly-markdown/

The problem with frontmatter is that it makes processing the files very annoying, particularly in APIs. I like it as a writer, but not a programmer.

davidar commented 9 years ago

I know this isn't the right place for this discussion, but I'll bite. I haven't used npm much, so I may be missing something, but looking at the spec, nothing particularly novel jumps out at me. It just looks like all the standard package fields, but in JSON.

Don't get me wrong, JavaScript actually ranks reasonably highly on my list compared to a lot of alternatives. But this current trend that JavaScript is the solution to every problem, and somehow solves it better than every other programming language, is frankly ridiculous. People complain about Haskell monads being painful, and yet callback hell is the best thing since sliced bread. Green threads have been around for a long time, and other languages have done a lot more in getting concurrency right. Don't even get me started on atom (1GB+ of ram for a text editor, seriously?).

whyrusleeping commented 9 years ago

(1GB+ of ram for a text editor, seriously?).

~17KB baby! anything more is bloat.

whyrusl+  5126  4.6  0.2 198740 16996 pts/3    S+   10:37   0:00 vim repo/fsrepo/fsrepo.go

eminence commented 9 years ago

+1 on the data-packages format. My original proposal is fairly similar to this, so it matches up pretty well with what I had in mind

davidar commented 9 years ago

@eminence @jbenet Ok, so I'm thinking we should have:

an OKFN datapackage.json file,
a verbatim LICENSE file (either in the top-level directory, or sub-directories in the case of multiple licenses), and
a standard README(.md) file containing any lengthy descriptions, etc.

jbenet commented 9 years ago

@davidar SGTM.

And, not talking about javascript. Talking about npm. This is inconsistent:

:-1: Yeah... I'm not drinking the NodeJS kool-aid ;)
:+1: Thanks, that looks even better.
the OKFN data-package.json is directly derived from npm's package.json.

The point is that the statement "not drinking the <THING> kool-aid" is typical of actively ignoring whatever <THING> is, including anything that may be good and valuable, instead of studying <THING> and dismissing the provably bad parts. I'm really tired of the js-hate, particularly when people make inconsistent or uninformed claims, like dismissing npm without even trying it, or understanding why it is well designed. It is similar to the dismissal that haskell gets from the "hardcore C/C++ systems people" (i.e. because they've not taken the time to understand it).

anyway, yep. not really worth discussing here.

davidar commented 9 years ago

The point is that the statement "not drinking the kool-aid" is typical of actively ignoring whatever is, including anything that may be good and valuable, instead of studying and dismissing the provably bad parts.

@jbenet Alright, I apologise for my wording, I could have phrased it better. For the record, I would have been equally as dismissive about using Python's packaging format for this, or Debian's, or whatever, simply because software packaging and data packaging are different problems. In any case, I can approve of a data packaging format which happens to be derived from a small part of NPM without necessarily approving of NPM as a whole. It's not that I dislike NPM in particular, I just don't see the relevance to data packaging in comparison to any other software packaging format.

Also, despite what people seem to think, I don't hate JS/NPM anymore than I hate Python/PyPI (which I use quite often). What I do hate is when people try to apply them to things outside of the domain in which it makes sense to do so ("when all you have is a hammer, everything looks like a nail"). My motto is "all programming languages suck, but some suck less than others in specific circumstances". JS is a good choice for some problems, for others it sucks (e.g. atom, IMHO). Haskell is good for some things, for systems programming it sucks. Python ... you get the idea.

In terms of NodeJS (and last I checked NPM is an official subproject) in particular, it's kind of the embodiment of applying a language to a problem it was never meant to solve. If it were marketed as a scripting language (in the same category as Python), then I wouldn't have a problem with it, but a lot of people make far more overzealous claims about it (yes, it's better than PHP, but there's a lot of other languages that are better still). It would be like trying to run Perl or Fortran in a web browser (please tell me nobody has tried that :). As a result, it makes me automatically skeptical of assertions about it's superiority without any supporting facts. You keep telling me NPM is well-designed, and I've used it a little and tried researching myself to understand what you mean, but I'm not seeing anything all that special TBH. Like I said, please elaborate if you think I'm missing something.

Anyway, those were the thoughts I was trying to convey in my somewhat flippant remark :)

rht commented 9 years ago

~17KB baby! anything more is bloat.

neovim ~11KB

rht commented 9 years ago

Consider using spdx for license parsing, see https://github.com/ipfs/go-ipfs/issues/337.

jbenet commented 8 years ago

I reopened as https://github.com/ipfs/archives/issues/45 since this turned into js ~~bikeshedding~~ discussion

ipfs-inactive / archives

Archive metadata and licensing --> js discussion #25