Open mih opened 5 years ago
@mih - this is what the hope was for nidm-w to cover, the link between output and input via a process.
draft here: https://docs.google.com/document/d/1OjdvKyjSuLXoPrmH18SPj2Fe1bvkomQjowY7TG-F8MQ/edit the example use cases and model components would provide pointers. we are hoping to have draft out by summer, but feel free to use/refine anything at this point.
in terms of matching IDs, there are two lines of thought:
sameAs
except for blank files. i don't know what to do with those yet.the more complex setup we discussed in nidm-w was that you would have a blob descriptor and a file descriptor. thus you can have the same binary content, but pointed at by different filenames or paths. this is closer to the semantics of content addressable/dedup stores.
in the ideal world (or even in the datalad world), we would be only running commands on databases shared worldwide and everybody would use the same input id, but in the current world people download datasets, orphan any database info and just use.
if you think there is a way to connect the graph without a match operation, i would be interested in hearing that path (but do make the universe slightly larger than datalad :) ).
Cool, looking forward to the poster!
if you think there is a way to connect the graph without a match operation, i would be interested in hearing that path (but do make the universe slightly larger than datalad :) ).
I just need one of these two:
ATM I don't have any of those, because the checksum needs to be matched to the NIDM choice of SHA512 (fine in principle, but relatively slow, as I have to checksum many files in a dataset in order to be able to find a few entries in the metadata report).
And I don't have filenames to match against, because they get forcefully normalized. If the latter can just be turned off, I can cheaply transform the meta data report into something that is immediately useful in the datalad context.
@mih - no poster this time around, but do hope to get the draft done.
on the nidm side, i would be completely fine with adding multiple hashes if that helps in the short term. @cmaumet - what do you think?
@mih: Exciting to think about combining all those metadata! Here are additional thoughts for your two proposed approaches:
Can you give me more details on how unique IDs are computed in datalad? We are currently using random UUID in nidm exporters but any alternative that can be easily computed from the file itself we could use instead.
Some questions/possible issues that comes to my mind (there might be more I'd have to think more about this :) )
nidm.json
file to keep only the "digested" metadata inside datalad?
nidm.json
). nidm.json
then we have to think about how we deal with files being moved around and/or renamed (as the relative paths will be broken by those operations)And re:satra: multiple hashes indeed seems like a good compromise :) Which one is your favourite flavour?
Let me know what you think!
Can you give me more details on how unique IDs are computed in datalad?
We are (re-)using the git annex key format: https://git-annex.branchable.com/internals/key_format/
For any file that is not annex'ed, we use the Git SHA1, but format the ID as if Git-annex would have made it. The issue is that git annex allows one to set any custom key backend (git config option annex.backend
), by default datalad uses the MD5 hash backend. So taken together, it looks like this (by default):
MD5E-s4--ba1f2511fc30423bdbb183fe33f3dd0f
SHA1-s4--7b5813c6a7ebef887d9dc34812413e64603bc838
Re-using the hashes that are already used for identification in the version control system makes things fast, and immediately connects to any information known about any file in any part of the system.
Regarding the other aspects:
I think it would be unwise to bend nidmfsl to primarily match the use case of datalad. As far as I understand it, its purpose is to compose a standalone, comprehensive description of an analysis result. The choice of IDs, file name normalization etc all make perfect sense for this use case, hence that should stay as-is. Adding a swarm of additional hashes just makes things heavier, but adds no immediate advantage.
With datalad in the picture, there are two additional use cases.
There is a nidm result pack (zip file) that is part of a datalad dataset. In this case, we would want to report the contained metadata to aid discovery of the NIDM result pack and the files it contains. There is no need to map anything: datalad knows an ID for the zip file, we can associate the metadata in the zip file with the zip file, done. This use case can be supported without any changes, as far as I can see. And it is an important use case (think about datasets with just NIDM result packs...).
The other use case (that my whining originates in): I am not interested in a NIDM result pack. I want to use the functionality of nidmfsl
to describe an analysis output directory tree that I have right in front of me. I want to attach meaningful metadata to myfirstanalysis.feat/stats/zstat1.nii.gz
, in order for people, not to find a comprehensive description of an analysis, but the dataset that has the actual output -- maybe because I am interested in something completely different in that dataset, but I use a nidm result property to compose a list of interesting datasets.
For this second use case I am not concerned about the absence of a standard error map, because it is not in the dataset anyways. I am not concerned about not finding a nidm.json
file, because I could build one myself with the latest nidmfsl
version, because I found the entire dataset with all data. This last aspect is absolutely criticial IMHO, as metadata description standard evolve, and we want to be able to go back to the original data, and get a better description for them.
Of course these two use cases are not mutually exclusive. As a datalad user that puts a FEAT output in a dataset, I can decide to include a standardized description of my analysis by running nidmfsl
on it. I would do that through datalad run
, such that datalad records the provenance of this description generation, and associates it with the state of the dataset.
However, as far as datalad as a tool is concerned, I would not want to enforce the presence of a normalized description as part of the dataset content. Here is the reason: I am interested in doing some analysis in a large number of FSL analyses. I find them all over the web, they are all in datalad datasets (because that is what people do ;-). I can link them all as sub-datasets into a new (super) dataset that will have code and results of my analysis. I do not want to add any content to those 3rd-party datasets. If I would, I would have a unique version of these datasets that I either need to host myself, push back to 2163 different original locations (after becoming friends with their respective maintainers), or I would keep them for myself (because it is just too complicated otherwise) and break transparency and reproducibility in some way. What datalad can do is to extract nidmfsl metadata on those original datasets, and aggregate them into my new super-dataset. This dataset then has all the information needed to perform any data discovery tasks, and only this dataset needs to be hosted to share ALL of what I did with the target audience.
I anticipate that you might be saying: "why not simply create NIDM result packs for all these analyses and put those into the super dataset?" The reason is that those packs are standalone, and their metadata is self-contained (for good reasons), and that implies that I cannot use them to find the actual analysis dataset that they have been created from -- and this is what datalad needs/wants to do.
How can this be done? I think the cheapest approach is to add an option to nidmfsl
that keeps the original path names of the files that are being described (i.e. no filename normalization). Alternatively, a separate file (outside the result pack) could be generated that contains the mapping: original path -> normalized filename. Based on any of those outputs one should be able to post-process that metadata graph to map its random file IDs to the common ones used within datalad already. I know this is a side-show for nidmfsl, but it would enable cool stuff, I think.
@mih - we could add filenames, but i think in general that does not solve the problem. here is an example. let's say i have N bids datasets and i want to run my fsl workflow together with nidmfsl in a container.
singularity/docker -<mountflag> /path/to/dataset:/dataset <container_image>
all the nidm results will be stored inside with /dataset/<sub-id>/path/to/file
. given bids, this will be the same except for perhaps the dataset that has the largest number of subjects, and for variations in task names. in general, there would not be a way to uniquely identify things, just from the filename, even if it's not normalized. of course, you could sequence the processing in a way to know which dataset is being used, but not from result packs.
thus cryptographic hashes, i think, as still the most relevant bits here to help connect.
the reason we chose sha512 is that it is faster to compute than sha256 while addressing some levels of paranoia.
all the nidm results will be stored inside
Ah, I think we have the misunderstanding. You are talking about the use case of nidmfsl creating a description for storage -- that is fine as-is and should not change, also no change in filenames.
I am talking about calling the Python API of nidmfsl "myself", inside a DataLad metadata extractor, in order to suck in the JSON-LD output and ingest it into DataLad's metadata handling. Wherever this runs, it runs inside a DataLad dataset that has the original analysis output directory (no nidmfsl files). And any paths it would report would be inside the dataset (of which I know where it is).
@mih - i see. thanks for clarifying. so you would be running the converter explicitly on a dataset from datalad. yes, for that generating the filename to uuid map would work.
i would still like to solve the general connectivity issue :)
@mih - the other way to do this through the API is to explicitly set the hash function when you call it from datalad (or add to it to generate multiple hashes). the increase in time should be ok.
Discussed w/ @mih today, I should:
be able to set the ids using an input JSON files with key-value pairs being full_path-ID
Here is a mapping file that I can easily generate for any analysis path that I would then feed to nidmfsl
. Would this work for you?
demo_map.json.txt (for the same dataset and 2nd-level analysis ad the demo in #148)
First a bit of background. Here is a document on a tstats volume that I get from
nidmfsl
(I have cut bits that are irrelevant for now):This looks great. However, I have more information that I can get on this file from other sources
And even more from extractors for the NIfTI data type, from a provenance extractor that tells me which command generated this file etc etc. Of course I have the desire to connect this information. Question is: how can I do this with minimal effort and interference. Without considering any changes I would have to compute SHA512 on all files in my dataset and then fish out documents with a matching
crypto:sha512
property, and add some kind ofsameAs
property pointing to the alternative@id
for the same file.... Doable but not nice, given the context (light pun intended ;-).The context is: this is a DataLad dataset. All files and their versions are uniquely identified already, and also the datasets that contain them. I want DataLad to be able to tell information on a file that I can get from nidmfsl.
I see two ways of making this happen: 1) teach nidmfsl about DataLad and its IDs so it can use them without having to come up with new random ones (I dont think this is the most feasible approach, but would certanly be an amazing one!). 2) Teach NIDMFSL not to build a metadata pack, but leave all files exactly where they are and make it use relative paths (relative to a given root -- the dataset location in my case) as
nfo:fileName
. As the metadata extraction is guaranteed to happen on the same content, I can easily match files by relative path.This second approach would also solve another issue: I am not interested in the result package with the files renamed to a standardized form. I actually throw it away. DataLad already gives me the means to obtain exactly what I need in terms of file content -- once I know the file key -- hence my interest of focused on connecting information to this file key (among other desires). With this information I can have a simple helper generate me NIDM result packages just based on the pre-extracted metadata. So at present, properties like
"nfo:fileName": "TStatistic.nii.gz"
are actually invalid for a description of dataset content.So it would be nice to have a "just describe" mode, in addition to the "pack and describe" mode that I can currently use.
Does that make sense? I'd love to get pointers on the direction in which I should be looking/moving. Would be great if I could have everything implemented till OHBM.,
Thx much!