Closed ValWood closed 2 years ago
I thought we made a decision about standardising the directory organization /naming?
I thought the "directory name" was in this file https://docs.google.com/spreadsheets/d/13STLSIvYcqKVaFxz_g3XKOkd-huwbu8qrdQWn9jVN8w/edit#gid=0
but it must be in another one?
@Antonialock ?
Probably should do the directory arrangement next as this will change the file paths? and related https://github.com/pombase/website/issues/692
from #757
ftp://ftp.pombase.org/external_datasets/
didn't we somewhere discuss a file naming system that would help people to identify datasets?
this seems to be a mixture?
AI
rename files consistently make a spreadsheet of old name, new name Author_studyID_year_datatype (matching chado datatype) (if authors have more than one datatype, like soriano, put data in 2 directories) This will make it easier for users to collect everything of one data type with no prior knowledge and not needing to go elsewhere.
also need a web page to provide a link to this ticket Kim will add filters to this table -> update the other ticket
open future ticket, Kim will add filtering for access to different datasets
study description in a separate file?
Probably this tickets needs summarizing. Basically we want to standardize both the directory structure and the file names. This will need to be coordinated with the file names in the config file.
We need to describe the file naming and directory naming system.
This is out of date. We can put the directory renaming issue in the 'active' ticket.
The decision was that Kim would create a page with links to the datasets based on the metadata. This directory could actually be archived or removed?
Reopening until the existing directory with ad hoc files is dealt with...
A table of directory names ... current names also in text file browser_dirnames.txt
Old name | Suggested new name |
---|---|
atkinson_2018_transcript | Atkinson_2018_PMID_29914874_transcripts |
BindingSites | McDowall_2015_PMID_25361970_binding_sites (or update dirname and metadata contents to Lock_2018_PMID_30321395_binding_sites ;) ) |
Bitton_2014_intron_branch_point | Bitton_2014_PMID_24709818_intron_branch_point |
Bitton_IBP_24709818 | ? (not referenced in metadata file, but has same PMID as above, so maybe we don't need it) |
Daigaku | Daigaku_2015_PMID_25664722 (not referenced in metadata file at present, but see below) |
DJeffares_Diversity | Jeffares_2015_PMID_25665008 |
ERP001075 | Wilhelm_2012_PMID_18488015 |
ERP001483 | Marguerat_2012_PMID_23101633 |
Gagneur_PMID_26883383 | Eser_2016_PMID_26883383 |
Grech_2019_PMID_31077324 | Grech_2019_PMID_31077324 |
GSE24360 | Woolcock_2011_PMID_21151114 |
GSE41773 | Soriano_2013_PMID_24256300 |
GSE60712_Garg_PMID_25908789 | Garg_2015_PMID_25908789 |
GSE62108 | merge into Daigaku_2015_PMID_25664722 |
GSE84910_PMID_27662899_Gonzalez | Gonzalez_2016_PMID_27662899 |
GSM1519714 | merge into Daigaku_2015_PMID_25664722 |
GSM1519715 | merge into Daigaku_2015_PMID_25664722 |
GSM1519716 | merge into Daigaku_2015_PMID_25664722 |
HuaLiTSS | Li_2015_PMID_25747261 |
JuanMata_polyA | Mata_2013_PMID_23900342 |
Lee_PMID_32101745_Hermes | Lee_2020_PMID_32101745_Hermes |
Mickle_2007 | SPLIT into Segurado_2003_PMID_14566325 and Mickle_2007_PMID_18093330 |
MSchlackow_polyA | Schlackow_2013_PMID_24152550 |
NRhind_RepProfile | Xu_2012_PMID_22531001 |
NRhind_Transcriptome | Rhind_2011_PMID_21511999 |
Thodberg_GSE110976 | Thodberg_2018_PMID_30566651 |
Yadav_and_Dubey_SIDD_PMID23163955 | Yadav_2012_PMID_23163955_SIDD |
Oh lovely. It was too irritating in a very WTF??? way before. Good to eventually weed out all the Ensembl-style of "no system" data handling.
I agree that we don't need Bitton_IBP_24709818 because I assume IBP is "intron branch point" and there can only be one consensus set. @bahler can you confirm?
Yes, Bitton_2014_intron_branch_point should be enough.
Thanks Midori. Should I go ahead and rename everything based on your table?
Should I go ahead and rename everything based on your table?
That would be fine with me! Sounds like Val is happy with it too.
yes! Thanks!
Was this done?
Not done yet.
ping - it'd be nice to get this done so we have sensible names that I can imitate for https://github.com/pombase/curation/issues/3045
I'll do this soon. I plan to do it one weekend as I suspect I will break everything on my first try. I've put it in the PomBase Google calendar for when I'm back from holiday.
I planned to do this over the weekend but I ended up moving my email to Outlook instead. I didn't have time for anything else after that.
I'll try to do this next weekend.
I did the renaming over the weekend. Everything seems fine in JBrowse.
We need to decide about this one:
Bitton_IBP_24709818 ? (not referenced in metadata file, but has same PMID as above, so maybe we don't need it)
We need to decide about this one: Bitton_IBP_24709818
I vote we stash the contents somewhere not publicly visible Just In Case, and then blitz it. I'm 99% sure it's the same intron branch point data as Bitton_2014_PMID_24709818_intron_branch_point
.
Agreed, I'm sure that was the only dataset for this paper.
I vote we stash the contents somewhere not publicly visible Just In Case, and then blitz it.
OK, thanks.
Done.
ftp://ftp.pombase.org/DATASET/
we can do this today.
we will decide what we want to include in the file names to recognise them. and organism by "type of track" (biologically)
(or whatever we think best when we go through them)
this was our simple track naming system https://curation.pombase.org/pombase-trac/wiki/TrackRelabelling
A suggestion
Keep the tracks in directories for each biological data type transcript chromatin binding replication profiling nucleosome positioning etc
Name the files consistently and uniquely: First_authorPMID(some distinguishing criteria for multiple data)
Have a table in each directory with the required metadata tab delimited https://curation.pombase.org/pombase-trac/wiki/TrackRelabelling (data type (CV), method (CV), track details, strand, citation)
(should also add external db xref to repository for raw data if available)