Howto link up NIDMFSL metadata with other source

mih commented 5 years ago

First a bit of background. Here is a document on a tstats volume that I get from nidmfsl (I have cut bits that are irrelevant for now):

            {
              "@id": "niiri:d83e9fb8-1e39-42f9-9c46-d408335a0150",
              "@type": [
                "nidm_StatisticMap",
                "prov:Entity"
              ],
              "crypto:sha512": "bbcfe23bf0d12500bf5db37ccca98c45ef6393912680552fb3646c5a50afc02ea023063e2171a0350db8f1a3921fc2c5dc327e0c21e38b123d7df8dd7f673c5b",
              "dct:format": "image/nifti",
              "nfo:fileName": "TStatistic.nii.gz",
              "prov:atLocation": {
                "@type": "http://www.w3.org/2001/XMLSchema#anyURI",
                "@value": "TStatistic.nii.gz"
              },
              ...
            }

This looks great. However, I have more information that I can get on this file from other sources

{
    "@context": {...},
    "@graph": [
      {
        "@id": "6398025b6f926d7d31202e29020331a80093ee89",
        "@type": "Dataset",
        "contentbytesize": 11061996,
        "dateCreated": "2019-05-09T10:10:11+02:00",
        "dateModified": "2019-05-09T10:11:40+02:00",
        "hasContributor": {
          "@id": "40395047096243tr832094023407237402304972"
        },
        "identifier": "dda42d6c-7231-11e9-a901-0050b6902ef0",
        "version": "0-2-g6398025",
        "hasPart": [
          {
            "@id": "MD5E-s384103--fbc66513e45fd3319e17ba74421fb484.nii.gz",
            "@type": "DigitalDocument",
            "name": "down/cope1.feat/stats/tstat1.nii.gz"
            "contentbytesize": 384103,
         },
   ...
}

And even more from extractors for the NIfTI data type, from a provenance extractor that tells me which command generated this file etc etc. Of course I have the desire to connect this information. Question is: how can I do this with minimal effort and interference. Without considering any changes I would have to compute SHA512 on all files in my dataset and then fish out documents with a matching crypto:sha512 property, and add some kind of sameAs property pointing to the alternative @id for the same file.... Doable but not nice, given the context (light pun intended ;-).

The context is: this is a DataLad dataset. All files and their versions are uniquely identified already, and also the datasets that contain them. I want DataLad to be able to tell information on a file that I can get from nidmfsl.

I see two ways of making this happen: 1) teach nidmfsl about DataLad and its IDs so it can use them without having to come up with new random ones (I dont think this is the most feasible approach, but would certanly be an amazing one!). 2) Teach NIDMFSL not to build a metadata pack, but leave all files exactly where they are and make it use relative paths (relative to a given root -- the dataset location in my case) as nfo:fileName. As the metadata extraction is guaranteed to happen on the same content, I can easily match files by relative path.

This second approach would also solve another issue: I am not interested in the result package with the files renamed to a standardized form. I actually throw it away. DataLad already gives me the means to obtain exactly what I need in terms of file content -- once I know the file key -- hence my interest of focused on connecting information to this file key (among other desires). With this information I can have a simple helper generate me NIDM result packages just based on the pre-extracted metadata. So at present, properties like "nfo:fileName": "TStatistic.nii.gz" are actually invalid for a description of dataset content.

So it would be nice to have a "just describe" mode, in addition to the "pack and describe" mode that I can currently use.

Does that make sense? I'd love to get pointers on the direction in which I should be looking/moving. Would be great if I could have everything implemented till OHBM.,

Thx much!

satra commented 5 years ago

@mih - this is what the hope was for nidm-w to cover, the link between output and input via a process.

draft here: https://docs.google.com/document/d/1OjdvKyjSuLXoPrmH18SPj2Fe1bvkomQjowY7TG-F8MQ/edit the example use cases and model components would provide pointers. we are hoping to have draft out by summer, but feel free to use/refine anything at this point.

in terms of matching IDs, there are two lines of thought:

the id of file can be it's sha, but that's not useful because: a) one would either have to restrict it to a specific sha, or compute all possible sha's. b) blank files are a nuisance - they can mean different things even though the object is the same.
give a random id, and merge at the database level. which is our current approach. so yes, we would match crypto-based. and use sameAs except for blank files. i don't know what to do with those yet.

the more complex setup we discussed in nidm-w was that you would have a blob descriptor and a file descriptor. thus you can have the same binary content, but pointed at by different filenames or paths. this is closer to the semantics of content addressable/dedup stores.

in the ideal world (or even in the datalad world), we would be only running commands on databases shared worldwide and everybody would use the same input id, but in the current world people download datasets, orphan any database info and just use.

if you think there is a way to connect the graph without a match operation, i would be interested in hearing that path (but do make the universe slightly larger than datalad :) ).

mih commented 5 years ago

Cool, looking forward to the poster!

if you think there is a way to connect the graph without a match operation, i would be interested in hearing that path (but do make the universe slightly larger than datalad :) ).

I just need one of these two:

a checksum to match against
a filename to match against

ATM I don't have any of those, because the checksum needs to be matched to the NIDM choice of SHA512 (fine in principle, but relatively slow, as I have to checksum many files in a dataset in order to be able to find a few entries in the metadata report).

And I don't have filenames to match against, because they get forcefully normalized. If the latter can just be turned off, I can cheaply transform the meta data report into something that is immediately useful in the datalad context.

satra commented 5 years ago

@mih - no poster this time around, but do hope to get the draft done.

on the nidm side, i would be completely fine with adding multiple hashes if that helps in the short term. @cmaumet - what do you think?

cmaumet commented 5 years ago

@mih: Exciting to think about combining all those metadata! Here are additional thoughts for your two proposed approaches:

Can you give me more details on how unique IDs are computed in datalad? We are currently using random UUID in nidm exporters but any alternative that can be easily computed from the file itself we could use instead.
Some questions/possible issues that comes to my mind (there might be more I'd have to think more about this :) )
- how de we deal with files that are generated by the NIDM exporter but not natively available in the software output (e.g. standard error map in SPM has to be computed from the available data).
- Is the plan to remove completely the nidm.json file to keep only the "digested" metadata inside datalad?
  - If we do then it gets harder for people to use those results to perform a meta-analysis (as we have basically lost the standardisation aspect of NIDM, that existed through file naming or could be retreived from nidm.json).
  - If we keep nidm.json then we have to think about how we deal with files being moved around and/or renamed (as the relative paths will be broken by those operations)

And re:satra: multiple hashes indeed seems like a good compromise :) Which one is your favourite flavour?

Let me know what you think!

mih commented 5 years ago

Can you give me more details on how unique IDs are computed in datalad?

We are (re-)using the git annex key format: https://git-annex.branchable.com/internals/key_format/

For any file that is not annex'ed, we use the Git SHA1, but format the ID as if Git-annex would have made it. The issue is that git annex allows one to set any custom key backend (git config option annex.backend), by default datalad uses the MD5 hash backend. So taken together, it looks like this (by default):

annex'ed file: MD5E-s4--ba1f2511fc30423bdbb183fe33f3dd0f
file directly in Git: SHA1-s4--7b5813c6a7ebef887d9dc34812413e64603bc838

Re-using the hashes that are already used for identification in the version control system makes things fast, and immediately connects to any information known about any file in any part of the system.

Regarding the other aspects:

I think it would be unwise to bend nidmfsl to primarily match the use case of datalad. As far as I understand it, its purpose is to compose a standalone, comprehensive description of an analysis result. The choice of IDs, file name normalization etc all make perfect sense for this use case, hence that should stay as-is. Adding a swarm of additional hashes just makes things heavier, but adds no immediate advantage.

With datalad in the picture, there are two additional use cases.

There is a nidm result pack (zip file) that is part of a datalad dataset. In this case, we would want to report the contained metadata to aid discovery of the NIDM result pack and the files it contains. There is no need to map anything: datalad knows an ID for the zip file, we can associate the metadata in the zip file with the zip file, done. This use case can be supported without any changes, as far as I can see. And it is an important use case (think about datasets with just NIDM result packs...).
The other use case (that my whining originates in): I am not interested in a NIDM result pack. I want to use the functionality of nidmfsl to describe an analysis output directory tree that I have right in front of me. I want to attach meaningful metadata to myfirstanalysis.feat/stats/zstat1.nii.gz, in order for people, not to find a comprehensive description of an analysis, but the dataset that has the actual output -- maybe because I am interested in something completely different in that dataset, but I use a nidm result property to compose a list of interesting datasets.

For this second use case I am not concerned about the absence of a standard error map, because it is not in the dataset anyways. I am not concerned about not finding a nidm.json file, because I could build one myself with the latest nidmfsl version, because I found the entire dataset with all data. This last aspect is absolutely criticial IMHO, as metadata description standard evolve, and we want to be able to go back to the original data, and get a better description for them.

Of course these two use cases are not mutually exclusive. As a datalad user that puts a FEAT output in a dataset, I can decide to include a standardized description of my analysis by running nidmfsl on it. I would do that through datalad run, such that datalad records the provenance of this description generation, and associates it with the state of the dataset.

However, as far as datalad as a tool is concerned, I would not want to enforce the presence of a normalized description as part of the dataset content. Here is the reason: I am interested in doing some analysis in a large number of FSL analyses. I find them all over the web, they are all in datalad datasets (because that is what people do ;-). I can link them all as sub-datasets into a new (super) dataset that will have code and results of my analysis. I do not want to add any content to those 3rd-party datasets. If I would, I would have a unique version of these datasets that I either need to host myself, push back to 2163 different original locations (after becoming friends with their respective maintainers), or I would keep them for myself (because it is just too complicated otherwise) and break transparency and reproducibility in some way. What datalad can do is to extract nidmfsl metadata on those original datasets, and aggregate them into my new super-dataset. This dataset then has all the information needed to perform any data discovery tasks, and only this dataset needs to be hosted to share ALL of what I did with the target audience.

I anticipate that you might be saying: "why not simply create NIDM result packs for all these analyses and put those into the super dataset?" The reason is that those packs are standalone, and their metadata is self-contained (for good reasons), and that implies that I cannot use them to find the actual analysis dataset that they have been created from -- and this is what datalad needs/wants to do.

How can this be done? I think the cheapest approach is to add an option to nidmfsl that keeps the original path names of the files that are being described (i.e. no filename normalization). Alternatively, a separate file (outside the result pack) could be generated that contains the mapping: original path -> normalized filename. Based on any of those outputs one should be able to post-process that metadata graph to map its random file IDs to the common ones used within datalad already. I know this is a side-show for nidmfsl, but it would enable cool stuff, I think.

satra commented 5 years ago

@mih - we could add filenames, but i think in general that does not solve the problem. here is an example. let's say i have N bids datasets and i want to run my fsl workflow together with nidmfsl in a container.

singularity/docker -<mountflag> /path/to/dataset:/dataset <container_image>

all the nidm results will be stored inside with /dataset/<sub-id>/path/to/file. given bids, this will be the same except for perhaps the dataset that has the largest number of subjects, and for variations in task names. in general, there would not be a way to uniquely identify things, just from the filename, even if it's not normalized. of course, you could sequence the processing in a way to know which dataset is being used, but not from result packs.

thus cryptographic hashes, i think, as still the most relevant bits here to help connect.

the reason we chose sha512 is that it is faster to compute than sha256 while addressing some levels of paranoia.

mih commented 5 years ago

all the nidm results will be stored inside

Ah, I think we have the misunderstanding. You are talking about the use case of nidmfsl creating a description for storage -- that is fine as-is and should not change, also no change in filenames.

I am talking about calling the Python API of nidmfsl "myself", inside a DataLad metadata extractor, in order to suck in the JSON-LD output and ingest it into DataLad's metadata handling. Wherever this runs, it runs inside a DataLad dataset that has the original analysis output directory (no nidmfsl files). And any paths it would report would be inside the dataset (of which I know where it is).

satra commented 5 years ago

@mih - i see. thanks for clarifying. so you would be running the converter explicitly on a dataset from datalad. yes, for that generating the filename to uuid map would work.

i would still like to solve the general connectivity issue :)

satra commented 5 years ago

@mih - the other way to do this through the API is to explicitly set the hash function when you call it from datalad (or add to it to generate multiple hashes). the increase in time should be ok.

cmaumet commented 5 years ago

Discussed w/ @mih today, I should:

have an option to remove the copied files and the metadata about filenames
be able to set the ids using an input JSON files with key-value pairs being full_path-ID.

mih commented 5 years ago

be able to set the ids using an input JSON files with key-value pairs being full_path-ID

Here is a mapping file that I can easily generate for any analysis path that I would then feed to nidmfsl. Would this work for you?

demo_map.json.txt (for the same dataset and 2nd-level analysis ad the demo in #148)

incf-nidash / nidmresults-fsl

Howto link up NIDMFSL metadata with other source #155