bdbag downloaded from portal for superset_collection including files, samples and subjects of subset_collection

nsuvarnaiari commented 2 years ago

Hi Deriva team,

Question: I have a “superset_collection” with X number of "subset-collections" and "xxxx_in_collection.tsv" files are filled including each subset-collection. If I download the bdbag for “superset_collection”, do I get to see files, subjects and samples associated with each each "subset_collections" since I filled them in "xxxx_in_collection.tsv" and the superset_collection<->subset_collection linking is in "collection_in_collection.tsv"?

Karl thinks files, subjects and samples from "subset_collections" will not be included in the bdbag for "superset_collection" . He thinks this could be fixed so that it dumps the transitive closure of collection + collections subordinate via collection-in-collection.

@RLC-DCPPC @lliming @karlcz bringing this issue to your notice for future discussion.

Thanks, Suvvi

karlcz commented 2 years ago

Hmm, this didn't get hooked into the project planning and will not be addressed in the upcoming release.

Also, thinking about this a little more, it is unfortunately pretty complicated and nuanced. I think we will need further discussion to see if we can find consensus on export mode(s) that are of general use. I do not know right now which user expectations can be met and/or which export modes are easiest to explain.

But, I think it is infeasible to say that we will walk transitive closures of the many paths in C2M2 because it would often distort a filtered export back into a much larger set of items due to all the interconnectivity. If too many paths effectively mean "full export" I think we might as well just offer a canonical full dump BDBag for those who want to spelunk all the data, while keeping a much more slim/narrow export mode for dynamic filters so that people can ask for brief subsets directly focused on their search critiera...

@RLC-DCPPC @lliming @abradyIGS @mikedarcy

karlcz commented 2 years ago

As a general rule right now, the exports have a focus on the central table from which the user activates the export option.

The central table should have only the C2M2 entities matched by their search critiera or the single entity if they exported from a single record (detail) page.
Other tables are brought in via a connection/relevance to the entities exported in the central table. We can only follow one "path" for each export table, so we have chosen some reasonable heuristics for the most significant path. (See below)
Sometimes the extra connected tables may dump a superset where we could exploit some path through the portal model which we know brings all the relevant values but might also bring more irrelevant ones too. For example, we might dump some vocabulary terms even though they are not actually referenced by the core entities in the user query.

Export paths by focus

This is a summary of the export modes in the portal as of 2022-06. Each subsection is named by the central focus table that the user is viewing when they activate an export. The list of exported paths describes what content is exported.

Collection

collection.csv: the exact collections matched by the search
file.csv: all files associated to (1) by file_in_collection records
biosample.csv: all biosamples assocated to (1) by biosample_in_collection records
subject.csv: all subjects associated to (1) by subject_in_collection records
file_format.csv, data_type.csv, assay_type.csv, anatomy.csv, disease.csv, phenotype.csv, gene.csv, substance.csv, compound.csv, protein.csv, subject_granularity.csv, subject_role.csv, ncbi_taxonomy.csv, sex.csv, race.csv, ethnicity.csv : all terms linked to the core_fact, pubchem_fact, protein_fact, or gene_fact search classes referenced by (1)
collection_disease.csv, collection_phenotype.csv, collection_gene.csv, collection_compound.csv, collection_substance.csv, collection_taxonomy.csv, collection_anatomy.csv, collection_protein.csv: all associations referencing (1)
biosample_disease.csv, biosample_gene.csv, biosample_substance.csv, biosample_from_subject.csv: all associations referencing (3)
subject_role_taxonomy.csv, subject_race.csv, subject_substance.csv,subject_disease.csv,subject_phenotype.csv`: all associations referencing (4)
project.csv: all projects linked to the core_fact search classes referenced by (1) as well as reflexive, transitive closure of ancestor/super projects of those directly linked
project_in_project.csv: all associations where child-project is in (9)

Notable gaps:

file_describes_biosample and file_describes_subject do not seem to be dumped at all, inconsistently with biosample_from_subject
biosample_from_subject.csv might reference subjects which are not included in subject.csv since the latter is dumped via the subject_in_collection path
file_in_collection, biosample_in_collection, subject_in_collection, and collection_in_collection are not dumped at all

File

file.csv: the exact files matched by the user search
biosample.csv: all biosamples linked to (1) by direct file_describes_biosample associations
subject.csv: all subjects linked to (1) by direct file_describes_subject associations
file_format.csv, data_type.csv, assay_type.csv, anatomy.csv, disease.csv, gene.csv, substance.csv, compound.csv, subject_granularity.csv, subject_role.csv, ncbi_taxonomy.csv, sex.csv, race.csv, ethnicity.csv: all terms linked to the core_fact, pubchem_fact, protein_fact, or gene_fact search classes referenced by (1)
biosample_disease.csv, biosample_gene.csv, biosample_substance.csv, biosample_from_subject.csv: all associations referencing (2)
subject_role_taxonomy.csv, subject_race.csv, subject_substance.csv,subject_disease.csv,subject_phenotype.csv`: all associations referencing (3)
project.csv: all projects linked to the core_fact search classes referenced by (1) as well as reflexive, transitive closure of ancestor/super projects of those directly linked
project_in_project.csv: all associations where child-project is in (7)

Notable gaps:

by design, collection and collection-level associations are not dumped at all
file_describes_biosample and file_describes_subject do not seem to be dumped at all, inconsistently with biosample_from_subject
biosample_from_subject.csv might reference subjects which are not included in subject.csv since the latter is dumped via the file_describes_subject path

Biosample

biosample.csv: the exact biosamples matched by the user search
subject.csv: all subjects linked to (1) by direct biosample_from_subject associations
assay_type.csv, anatomy.csv, disease.csv, gene.csv, substance.csv, compound.csv, subject_granularity.csv, subject_role.csv, ncbi_taxonomy.csv, sex.csv, race.csv, ethnicity.csv: all terms linked to the core_fact, pubchem_fact, protein_fact, or gene_fact search classes referenced by (1)
biosample_disease.csv, biosample_gene.csv, biosample_substance.csv, biosample_from_subject.csv: all associations referencing (1)
subject_role_taxonomy.csv, subject_race.csv, subject_substance.csv,subject_disease.csv,subject_phenotype.csv`: all associations referencing (2)
project.csv: all projects linked to the core_fact search classes referenced by (1) as well as reflexive, transitive closure of ancestor/super projects of those directly linked
project_in_project.csv: all associations where child-project is in (6)

Notable gaps:

by design, collection and collection-level associations are not dumped at all
by design, file and file-level associations are not dumped at all

Subject

subject.csv: the exact subjects matched by the user search
disease.csv, subject_granularity.csv, subject_role.csv, ncbi_taxonomy.csv, sex.csv, race.csv, ethnicity.csv: all terms linked to the core_fact search classes referenced by (1)
subject_role_taxonomy.csv, subject_race.csv, subject_substance.csv,subject_disease.csv,subject_phenotype.csv`: all associations referencing (1)
project.csv: all projects linked to the core_fact search classes referenced by (1) as well as reflexive, transitive closure of ancestor/super projects of those directly linked
project_in_project.csv: all associations where child-project is in (4)

Notable gaps:

by design, collection and collection-level associations are not dumped at all
by design, file and file-level associations are not dumped at all
by design, biosample and biosample-level associations are not dumped at all
substance is not dumped even though subject_substance is!

nih-cfde / cfde-deriva