make non DOI data sets identifiable

theislab / sfaira

data and model repository for single-cell data

https://sfaira.readthedocs.io

BSD 3-Clause "New" or "Revised" License

134 stars 11 forks source link

make non DOI data sets identifiable #57

Closed davidsebfischer closed 3 years ago

davidsebfischer commented 3 years ago

This applies to all data sets currently in d_nan.

Zethson commented 3 years ago

Which solution do you envision? Should we just generate a unique hash? MD5 for all files and then generate a hash from that?

Alternatively, we could use the names to go for a consistent, unique scheme?

davidsebfischer commented 3 years ago

I'd prefer a unique scheme based on names, I would consider MD5 as an orthogonal mechanism that could be applied to all data sets. Essentially we can use the same scheme as before and just need to replace doi with a constant string for example.

davidsebfischer commented 3 years ago

@Zethson let's also decide this now? how about we structure this by source, e.g. we could create "no_doi_10x_genomics" as a DOI equivalent, meaning that their websites identifies these data sets? If we find more such sources, we could add them similarly. This would also make sense thinking about how these data files are then deposited on disk, this way they will all lie together.

Zethson commented 3 years ago

Yeah, that sounds reasonable. I like that it will result in a nice grouping.