nih-cfde / cfde-deriva

Collaboration point for miscellaneous CFDE-deriva scripts
Other
2 stars 3 forks source link

bdbag downloaded from portal for superset_collection including files, samples and subjects of subset_collection #356

Open nsuvarnaiari opened 2 years ago

nsuvarnaiari commented 2 years ago

Hi Deriva team,

Question: I have a “superset_collection” with X number of "subset-collections" and "xxxx_in_collection.tsv" files are filled including each subset-collection. If I download the bdbag for “superset_collection”, do I get to see files, subjects and samples associated with each each "subset_collections" since I filled them in "xxxx_in_collection.tsv" and the superset_collection<->subset_collection linking is in "collection_in_collection.tsv"?

Karl thinks files, subjects and samples from "subset_collections" will not be included in the bdbag for "superset_collection" . He thinks this could be fixed so that it dumps the transitive closure of collection + collections subordinate via collection-in-collection.

@RLC-DCPPC @lliming @karlcz bringing this issue to your notice for future discussion.

Thanks, Suvvi

karlcz commented 2 years ago

Hmm, this didn't get hooked into the project planning and will not be addressed in the upcoming release.

Also, thinking about this a little more, it is unfortunately pretty complicated and nuanced. I think we will need further discussion to see if we can find consensus on export mode(s) that are of general use. I do not know right now which user expectations can be met and/or which export modes are easiest to explain.

But, I think it is infeasible to say that we will walk transitive closures of the many paths in C2M2 because it would often distort a filtered export back into a much larger set of items due to all the interconnectivity. If too many paths effectively mean "full export" I think we might as well just offer a canonical full dump BDBag for those who want to spelunk all the data, while keeping a much more slim/narrow export mode for dynamic filters so that people can ask for brief subsets directly focused on their search critiera...

@RLC-DCPPC @lliming @abradyIGS @mikedarcy

karlcz commented 2 years ago

As a general rule right now, the exports have a focus on the central table from which the user activates the export option.

  1. The central table should have only the C2M2 entities matched by their search critiera or the single entity if they exported from a single record (detail) page.
  2. Other tables are brought in via a connection/relevance to the entities exported in the central table. We can only follow one "path" for each export table, so we have chosen some reasonable heuristics for the most significant path. (See below)
  3. Sometimes the extra connected tables may dump a superset where we could exploit some path through the portal model which we know brings all the relevant values but might also bring more irrelevant ones too. For example, we might dump some vocabulary terms even though they are not actually referenced by the core entities in the user query.

Export paths by focus

This is a summary of the export modes in the portal as of 2022-06. Each subsection is named by the central focus table that the user is viewing when they activate an export. The list of exported paths describes what content is exported.

Collection

  1. collection.csv: the exact collections matched by the search
  2. file.csv: all files associated to (1) by file_in_collection records
  3. biosample.csv: all biosamples assocated to (1) by biosample_in_collection records
  4. subject.csv: all subjects associated to (1) by subject_in_collection records
  5. file_format.csv, data_type.csv, assay_type.csv, anatomy.csv, disease.csv, phenotype.csv, gene.csv, substance.csv, compound.csv, protein.csv, subject_granularity.csv, subject_role.csv, ncbi_taxonomy.csv, sex.csv, race.csv, ethnicity.csv : all terms linked to the core_fact, pubchem_fact, protein_fact, or gene_fact search classes referenced by (1)
  6. collection_disease.csv, collection_phenotype.csv, collection_gene.csv, collection_compound.csv, collection_substance.csv, collection_taxonomy.csv, collection_anatomy.csv, collection_protein.csv: all associations referencing (1)
  7. biosample_disease.csv, biosample_gene.csv, biosample_substance.csv, biosample_from_subject.csv: all associations referencing (3)
  8. subject_role_taxonomy.csv, subject_race.csv, subject_substance.csv,subject_disease.csv,subject_phenotype.csv`: all associations referencing (4)
  9. project.csv: all projects linked to the core_fact search classes referenced by (1) as well as reflexive, transitive closure of ancestor/super projects of those directly linked
  10. project_in_project.csv: all associations where child-project is in (9)

Notable gaps:

File

  1. file.csv: the exact files matched by the user search
  2. biosample.csv: all biosamples linked to (1) by direct file_describes_biosample associations
  3. subject.csv: all subjects linked to (1) by direct file_describes_subject associations
  4. file_format.csv, data_type.csv, assay_type.csv, anatomy.csv, disease.csv, gene.csv, substance.csv, compound.csv, subject_granularity.csv, subject_role.csv, ncbi_taxonomy.csv, sex.csv, race.csv, ethnicity.csv: all terms linked to the core_fact, pubchem_fact, protein_fact, or gene_fact search classes referenced by (1)
  5. biosample_disease.csv, biosample_gene.csv, biosample_substance.csv, biosample_from_subject.csv: all associations referencing (2)
  6. subject_role_taxonomy.csv, subject_race.csv, subject_substance.csv,subject_disease.csv,subject_phenotype.csv`: all associations referencing (3)
  7. project.csv: all projects linked to the core_fact search classes referenced by (1) as well as reflexive, transitive closure of ancestor/super projects of those directly linked
  8. project_in_project.csv: all associations where child-project is in (7)

Notable gaps:

Biosample

  1. biosample.csv: the exact biosamples matched by the user search
  2. subject.csv: all subjects linked to (1) by direct biosample_from_subject associations
  3. assay_type.csv, anatomy.csv, disease.csv, gene.csv, substance.csv, compound.csv, subject_granularity.csv, subject_role.csv, ncbi_taxonomy.csv, sex.csv, race.csv, ethnicity.csv: all terms linked to the core_fact, pubchem_fact, protein_fact, or gene_fact search classes referenced by (1)
  4. biosample_disease.csv, biosample_gene.csv, biosample_substance.csv, biosample_from_subject.csv: all associations referencing (1)
  5. subject_role_taxonomy.csv, subject_race.csv, subject_substance.csv,subject_disease.csv,subject_phenotype.csv`: all associations referencing (2)
  6. project.csv: all projects linked to the core_fact search classes referenced by (1) as well as reflexive, transitive closure of ancestor/super projects of those directly linked
  7. project_in_project.csv: all associations where child-project is in (6)

Notable gaps:

Subject

  1. subject.csv: the exact subjects matched by the user search
  2. disease.csv, subject_granularity.csv, subject_role.csv, ncbi_taxonomy.csv, sex.csv, race.csv, ethnicity.csv: all terms linked to the core_fact search classes referenced by (1)
  3. subject_role_taxonomy.csv, subject_race.csv, subject_substance.csv,subject_disease.csv,subject_phenotype.csv`: all associations referencing (1)
  4. project.csv: all projects linked to the core_fact search classes referenced by (1) as well as reflexive, transitive closure of ancestor/super projects of those directly linked
  5. project_in_project.csv: all associations where child-project is in (4)

Notable gaps: