legumeinfo / datastore-specifications

Specifications for directory naming, file naming, file contents in the LIS datastore
2 stars 0 forks source link

New collection type: transcription #52

Closed StevenCannon-USDA closed 1 week ago

StevenCannon-USDA commented 4 months ago

I have provisionally added two collections under a new collection type: transcription: https://data.legumeinfo.org/Glycine/max/transcription/

The collections contain transcription start sites (TSS) and transcription start regions (TSR) for Glycine max Wm82, genome assemblies 2 and 4. These will accompany this paper from Jianxin Ma's group (currently in preprint form): https://doi.org/10.1101/2024.03.27.587116

A general question, I think for @adf-ncgr particularly: are there better ways to handle such features? They could conceivably be added to the corresponding main annotation collections; but they are distinct feature types and we want to associate them with a particular publication. In this respect, they are similar to markers (but they are not markers).

And an implementation question, probably for @nathanweeks and @weihuang12 : the primary destination for this data is as tracks on the JBrowse instances for these assemblies (Wm82.gnm2 and Wm82.gnm4). Can one or both of you make that happen? Is more needed on the data side? I'll mention that I have not yet committed these changes to https://github.com/legumeinfo/datastore-metadata; I'll wait for consensus that this is a suitable way of handling this kind of data in the Data Store.

Also tagging @maxglycine and @jd-campbell to track the issue.

adf-ncgr commented 4 months ago

@StevenCannon-USDA I think keeping them as a separate collection as you suggest is best. I could imagine lumping them under the existing "annotations" type, but you're probably right to split them out into a separate one given that they are only loosely associated with gene models. Thanks for checking

StevenCannon-USDA commented 4 months ago

OK - I have gone ahead and committed the metadata to the datastore-metadata repo.

StevenCannon-USDA commented 4 months ago

Reactivating this thread, since it is a bigger can'o worms than it had seemed at first.

The SoyBase group is discussing another data set: recombination hotspots (or haplotype blocks? The paper in question is https://academic.oup.com/g3journal/article/5/10/1999/6028905 ... but the topic here is broader).

Where to put sequence features that aren't among our standard collection types?

@maxglycine pointed out that the Sequence Ontology is a good place to look for potential types. Here is a partial list of types that wouldn't fit easily in our existing collection types:

  open_chromatin_region (SO:0001747)
  rearrangement_region (SO:0001872)
  pseudogenic_region (SO:0000462)
  mutational_hotspot (SO:0002186)
  CpG_island (SO:0000307)
  binding_site (SO:0000409)
  accessible_DNA_region (SO:0002331)
  sequence_motif (SO:0001683)
  repeat_region (SO:0000657)
    centromeric_repeat (SO:0001797)
  inversion (SO:1000036)
  substitution (SO:1000002)
  deletion (SO:0000159)
  recombination_feature (SO:0000298)
  gene_group (SO:0005855)
  transcript_region (SO:0000833)
  origin_of_replication (SO:0000296)
  intergenic_region (SO:0000605)

This makes me think that we need a generic seq_feature collection, to house genome-located features other than standard annotation features, markers, and GWAS features.

These might look like e.g.

seq_feature/Wm82.gnm1.sfeat.Song_Hyten_2015
  synopsis: Haplotype block regions for Glycine max accession Williams 82 genome assembly 1
seq_feature/Wm82.gnm2.sfeat.Wang_Duan_2024
  synopsis: Transcription initiation-site information for Glycine max accession Williams 82 genome assembly 2
seq_feature/Wm82.gnm4.sfeat.Wang_Duan_2024
  synopsis: Transcription initiation-site information for Glycine max accession Williams 82 genome assembly 4

... or could simpilify the type name as feature.

Reactions to adding a generic new category?

StevenCannon-USDA commented 1 week ago

This issue has resolved as follows: A new sequence_feature collection type, e.g. https://data.legumeinfo.org/Glycine/max/sequence_feature/ Within each collection, column 3 should continue to follow the GFF3 spec: type - type of feature. Must be a term or accession from the SOFA sequence ontology