Open nsuvarnaiari opened 2 years ago
Hmm, this didn't get hooked into the project planning and will not be addressed in the upcoming release.
Also, thinking about this a little more, it is unfortunately pretty complicated and nuanced. I think we will need further discussion to see if we can find consensus on export mode(s) that are of general use. I do not know right now which user expectations can be met and/or which export modes are easiest to explain.
But, I think it is infeasible to say that we will walk transitive closures of the many paths in C2M2 because it would often distort a filtered export back into a much larger set of items due to all the interconnectivity. If too many paths effectively mean "full export" I think we might as well just offer a canonical full dump BDBag for those who want to spelunk all the data, while keeping a much more slim/narrow export mode for dynamic filters so that people can ask for brief subsets directly focused on their search critiera...
@RLC-DCPPC @lliming @abradyIGS @mikedarcy
As a general rule right now, the exports have a focus on the central table from which the user activates the export option.
This is a summary of the export modes in the portal as of 2022-06. Each subsection is named by the central focus table that the user is viewing when they activate an export. The list of exported paths describes what content is exported.
collection.csv
: the exact collections matched by the searchfile.csv
: all files associated to (1) by file_in_collection
recordsbiosample.csv
: all biosamples assocated to (1) by biosample_in_collection
recordssubject.csv
: all subjects associated to (1) by subject_in_collection
recordsfile_format.csv
, data_type.csv
, assay_type.csv
, anatomy.csv
, disease.csv
, phenotype.csv
, gene.csv
, substance.csv
, compound.csv
, protein.csv
, subject_granularity.csv
, subject_role.csv
, ncbi_taxonomy.csv
, sex.csv
, race.csv
, ethnicity.csv
: all terms linked to the core_fact
, pubchem_fact
, protein_fact
, or gene_fact
search classes referenced by (1)collection_disease.csv
, collection_phenotype.csv
, collection_gene.csv
, collection_compound.csv
, collection_substance.csv
, collection_taxonomy.csv
, collection_anatomy.csv
, collection_protein.csv
: all associations referencing (1)biosample_disease.csv
, biosample_gene.csv
, biosample_substance.csv
, biosample_from_subject.csv
: all associations referencing (3)subject_role_taxonomy.csv
, subject_race.csv
, subject_substance.csv,
subject_disease.csv,
subject_phenotype.csv`: all associations referencing (4)project.csv
: all projects linked to the core_fact
search classes referenced by (1) as well as reflexive, transitive closure of ancestor/super projects of those directly linkedproject_in_project.csv
: all associations where child-project is in (9)Notable gaps:
file_describes_biosample
and file_describes_subject
do not seem to be dumped at all, inconsistently with biosample_from_subject
biosample_from_subject.csv
might reference subjects which are not included in subject.csv
since the latter is dumped via the subject_in_collection
pathfile_in_collection
, biosample_in_collection
, subject_in_collection
, and collection_in_collection
are not dumped at allfile.csv
: the exact files matched by the user searchbiosample.csv
: all biosamples linked to (1) by direct file_describes_biosample
associationssubject.csv
: all subjects linked to (1) by direct file_describes_subject
associationsfile_format.csv
, data_type.csv
, assay_type.csv
, anatomy.csv
, disease.csv
, gene.csv
, substance.csv
, compound.csv
, subject_granularity.csv
, subject_role.csv
, ncbi_taxonomy.csv
, sex.csv
, race.csv
, ethnicity.csv
: all terms linked to the core_fact
, pubchem_fact
, protein_fact
, or gene_fact
search classes referenced by (1)biosample_disease.csv
, biosample_gene.csv
, biosample_substance.csv
, biosample_from_subject.csv
: all associations referencing (2)subject_role_taxonomy.csv
, subject_race.csv
, subject_substance.csv,
subject_disease.csv,
subject_phenotype.csv`: all associations referencing (3)project.csv
: all projects linked to the core_fact
search classes referenced by (1) as well as reflexive, transitive closure of ancestor/super projects of those directly linkedproject_in_project.csv
: all associations where child-project is in (7)Notable gaps:
collection
and collection-level associations are not dumped at allfile_describes_biosample
and file_describes_subject
do not seem to be dumped at all, inconsistently with biosample_from_subject
biosample_from_subject.csv
might reference subjects which are not included in subject.csv
since the latter is dumped via the file_describes_subject
pathbiosample.csv
: the exact biosamples matched by the user searchsubject.csv
: all subjects linked to (1) by direct biosample_from_subject
associationsassay_type.csv
, anatomy.csv
, disease.csv
, gene.csv
, substance.csv
, compound.csv
, subject_granularity.csv
, subject_role.csv
, ncbi_taxonomy.csv
, sex.csv
, race.csv
, ethnicity.csv
: all terms linked to the core_fact
, pubchem_fact
, protein_fact
, or gene_fact
search classes referenced by (1)biosample_disease.csv
, biosample_gene.csv
, biosample_substance.csv
, biosample_from_subject.csv
: all associations referencing (1)subject_role_taxonomy.csv
, subject_race.csv
, subject_substance.csv,
subject_disease.csv,
subject_phenotype.csv`: all associations referencing (2)project.csv
: all projects linked to the core_fact
search classes referenced by (1) as well as reflexive, transitive closure of ancestor/super projects of those directly linkedproject_in_project.csv
: all associations where child-project is in (6)Notable gaps:
collection
and collection-level associations are not dumped at allfile
and file-level associations are not dumped at allsubject.csv
: the exact subjects matched by the user searchdisease.csv
, subject_granularity.csv
, subject_role.csv
, ncbi_taxonomy.csv
, sex.csv
, race.csv
, ethnicity.csv
: all terms linked to the core_fact
search classes referenced by (1)subject_role_taxonomy.csv
, subject_race.csv
, subject_substance.csv,
subject_disease.csv,
subject_phenotype.csv`: all associations referencing (1)project.csv
: all projects linked to the core_fact
search classes referenced by (1) as well as reflexive, transitive closure of ancestor/super projects of those directly linkedproject_in_project.csv
: all associations where child-project is in (4)Notable gaps:
collection
and collection-level associations are not dumped at allfile
and file-level associations are not dumped at allbiosample
and biosample-level associations are not dumped at allsubstance
is not dumped even though subject_substance
is!
Hi Deriva team,
Question: I have a “superset_collection” with X number of "subset-collections" and "xxxx_in_collection.tsv" files are filled including each subset-collection. If I download the bdbag for “superset_collection”, do I get to see files, subjects and samples associated with each each "subset_collections" since I filled them in "xxxx_in_collection.tsv" and the superset_collection<->subset_collection linking is in "collection_in_collection.tsv"?
Karl thinks files, subjects and samples from "subset_collections" will not be included in the bdbag for "superset_collection" . He thinks this could be fixed so that it dumps the transitive closure of collection + collections subordinate via collection-in-collection.
@RLC-DCPPC @lliming @karlcz bringing this issue to your notice for future discussion.
Thanks, Suvvi