ipfs / notes

IPFS Collaborative Notebook for Research
MIT License
401 stars 30 forks source link

Clarify how the use of URI identifiers maps to IPLD graphs #155

Open flyingzumwalt opened 8 years ago

flyingzumwalt commented 8 years ago

If I'm accustomed to using a Linked Data Platform (LDP) approach to model my data, I mint lots of stable identifiers to represent objects in the system. For example, in order to describe a PhD Thesis, I need an identifier for the creative work that is the thesis itself and I also need identifiers for each of the individual files within the thesis. These identifiers need to stay stable over time even if I add, update or remove individual files or metadata -- they are separate from the links that point to the specific content of the files in my thesis. In an LDP context, I can use any algorithm to mint these identifiers as long as (due to the constraints of Linked Data) they are HTTP URIs.

How can I translate these notions to IPLD? Do I still mint those HTTP URI identifiers and then use the cryptographic hashes of those URIs to incorporate them into the IPLD graph? Do I put those hashes into IPNS? How does this work?

Given that, please explain the benefit of using the hash of an (URI) identifier to build links (IPLD) rather than simply using the identifier itself (RDF).

cc @nicola

nicola commented 8 years ago

Here is the way I see the problem, is this a valid example @flyingzumwalt ?

http://nicola.com/nicola-identity.json
{
  type: "ProfileDocument"
}
http://matt.com/matt-identity.json
{
  type: "ProfileDocument"
}

In this case we have two documents that are identical, if we content-address them, they will have the same identifier (say hash0). Now we have to find a way to distinguish an identifier for nicola and one for matt.

So the problem is having unique identifiers for instances of the same class (in essence, two identical objects with two different names).

Are we on the same page?

flyingzumwalt commented 8 years ago

I think I'm asking about something simpler than that. I'm uncertain about how I would map existing Linked Data, which relies heavily on URIs, into IPLD.

Example:

(Note: this dataset is not actually published as Linked Open Data but is, in fact, stored in a Fedora Repository as Linked Data. I couldn't think of a simple LoD data source to grab an example from)

This URI https://purl.stanford.edu/mm346cv9110 is the unique identifier for a dataset that contains 4 items. Each of those items also has a URI to uniquely identify that item. For example, the URI for the Bioconductor Microbiome Work Files is https://purl.stanford.edu/wh250nn9648. These are identifiers for creative works, which are separate from the file content itself. In fact, there's even one more layer of URIs here because you need an identifier for the individual work files -- that's where you attach metadata like a description of the file.

So there is a whole hierarchy of URIs here:

This last item, the actual csv file, can be identified using the cryptographic hash of the file content itself. That part is obvious. How do all the other identifiers fit into IPLD and how do Linked Data predicates fit into an IPLD approach?

For example, a simplistic model of this dataset would look a bit like this:

'https://purl.stanford.edu/wh250nn9648':
{
  'dc:isPartOf': https://purl.stanford.edu/mm346cv9110 
  'dc:hasPart': https://stacks.stanford.edu/file/druid:wh250nn9648/metabolites.csv
}
'https://stacks.stanford.edu/file/druid:wh250nn9648/metabolites.csv':
{
 'dc:description': 'Data for triplet analysis: metabolites.'
 sha3: {the hash of the file content}
}

How does this info translate into IPLD? Do I take the hashes each of the URIs and predicates and then use those hashes to incorporate the URIs and predicates into the IPLD graph?

Also (this part's a bit of a straw horse so you can explain it) If that's the case, isn't it making the graph a lot clunkier since it adds an extra layer of lookups into every link in the graph?

e43324b622a0e7995e5778dae5146bc22785a8ee024831196c0027a3c9312471 > https://purl.stanford.edu/wh250nn9648 > dc:isPartOf > 6c926b4a114f99e706ebf0b34e362e65d91b6b6fd7d466b1d4feb9db6f680aee > https://purl.stanford.edu/mm346cv9110

instead of simply

https://purl.stanford.edu/wh250nn9648 > dc:isPartOf > https://purl.stanford.edu/mm346cv9110

sohkai commented 8 years ago

Likely to be of interested for COALA IP: cc @TimDaub. @nicola's problem is also something we'll need to consider.

flyingzumwalt commented 8 years ago

@nicola I think this ticket is mainly outlining a documentation need -- the IPLD spec needs to be accompanied by explanation and examples that show very clearly how URI identifiers from Linked Data get mapped into IPLD. Do you agree @nicola and @sohkai? If so, could we close this ticket and link to it from issue about documentation in the IPLD repo?

sohkai commented 8 years ago

@flyingzumwalt I get the impression @nicola will address this as part of the Authenticated RDF spec, after his update here: https://github.com/ipfs/notes/issues/152#issuecomment-239320116. Maybe leave it for now and then when there's a PR made to add the spec, it can close this issue?

flyingzumwalt commented 8 years ago

Sounds good to me.

nicola commented 8 years ago

Sorry, I must have missed the development of this issue. Ok, this issue is part of #152. The questions that you raise @flyingzumwalt, for example:

e43324b622a0e7995e5778dae5146bc22785a8ee024831196c0027a3c9312471 > https://purl.stanford.edu/wh250nn9648 > dc:isPartOf > 6c926b4a114f99e706ebf0b34e362e65d91b6b6fd7d466b1d4feb9db6f680aee > https://purl.stanford.edu/mm346cv9110

instead of simply

https://purl.stanford.edu/wh250nn9648 > dc:isPartOf > https://purl.stanford.edu/mm346cv9110

This is a design choice, you could totally replace the urls with the hashes. I will write the work that I have done very soon (with a simple implementation)

flyingzumwalt commented 8 years ago

Thanks @nicola. Also the new explanations and examples taking form on http://ipld.io/ do a lot to clarify this, but it will be good to have examples and explanation that specifically speak to translating existing Linked Data to IPLD (using JSON-LD).

barmintor commented 8 years ago

Hello,

I'm afraid I don't understand: If the hashes are stored as the identifiers, we have related resources as descriptions, but we frequently refer to the URI as an identifier to indicate that a relationship is independent of the state of the description. How do we mimic the web's provision of a synchronic identifier- how do we look at 2 hashes and know that the described resource is the same?