pombase / website

PomBase website v2
MIT License
6 stars 1 forks source link

rename ftp directories & files for browser data #670

Closed ValWood closed 2 years ago

ValWood commented 6 years ago

ftp://ftp.pombase.org/DATASET/

we can do this today.

we will decide what we want to include in the file names to recognise them. and organism by "type of track" (biologically)

(or whatever we think best when we go through them)

this was our simple track naming system https://curation.pombase.org/pombase-trac/wiki/TrackRelabelling

A suggestion

Keep the tracks in directories for each biological data type transcript chromatin binding replication profiling nucleosome positioning etc

Name the files consistently and uniquely: First_authorPMID(some distinguishing criteria for multiple data)

Have a table in each directory with the required metadata tab delimited https://curation.pombase.org/pombase-trac/wiki/TrackRelabelling (data type (CV), method (CV), track details, strand, citation)

(should also add external db xref to repository for raw data if available)

ValWood commented 6 years ago

I thought we made a decision about standardising the directory organization /naming?

ValWood commented 6 years ago

I thought the "directory name" was in this file https://docs.google.com/spreadsheets/d/13STLSIvYcqKVaFxz_g3XKOkd-huwbu8qrdQWn9jVN8w/edit#gid=0

but it must be in another one?

@Antonialock ?

ValWood commented 6 years ago

Probably should do the directory arrangement next as this will change the file paths? and related https://github.com/pombase/website/issues/692

ValWood commented 4 years ago

from #757

ftp://ftp.pombase.org/external_datasets/

didn't we somewhere discuss a file naming system that would help people to identify datasets?

this seems to be a mixture?

AI

rename files consistently make a spreadsheet of old name, new name Author_studyID_year_datatype (matching chado datatype) (if authors have more than one datatype, like soriano, put data in 2 directories) This will make it easier for users to collect everything of one data type with no prior knowledge and not needing to go elsewhere.

also need a web page to provide a link to this ticket Kim will add filters to this table -> update the other ticket

open future ticket, Kim will add filtering for access to different datasets

study description in a separate file?

ValWood commented 4 years ago

Probably this tickets needs summarizing. Basically we want to standardize both the directory structure and the file names. This will need to be coordinated with the file names in the config file.

We need to describe the file naming and directory naming system.

ValWood commented 4 years ago

This is out of date. We can put the directory renaming issue in the 'active' ticket.

ValWood commented 4 years ago

The decision was that Kim would create a page with links to the datasets based on the metadata. This directory could actually be archived or removed?

Reopening until the existing directory with ad hoc files is dealt with...

mah11 commented 4 years ago

A table of directory names ... current names also in text file browser_dirnames.txt

Old name Suggested new name
atkinson_2018_transcript Atkinson_2018_PMID_29914874_transcripts
BindingSites McDowall_2015_PMID_25361970_binding_sites (or update dirname and metadata contents to Lock_2018_PMID_30321395_binding_sites ;) )
Bitton_2014_intron_branch_point Bitton_2014_PMID_24709818_intron_branch_point
Bitton_IBP_24709818 ? (not referenced in metadata file, but has same PMID as above, so maybe we don't need it)
Daigaku Daigaku_2015_PMID_25664722 (not referenced in metadata file at present, but see below)
DJeffares_Diversity Jeffares_2015_PMID_25665008
ERP001075 Wilhelm_2012_PMID_18488015
ERP001483 Marguerat_2012_PMID_23101633
Gagneur_PMID_26883383 Eser_2016_PMID_26883383
Grech_2019_PMID_31077324 Grech_2019_PMID_31077324
GSE24360 Woolcock_2011_PMID_21151114
GSE41773 Soriano_2013_PMID_24256300
GSE60712_Garg_PMID_25908789 Garg_2015_PMID_25908789
GSE62108 merge into Daigaku_2015_PMID_25664722
GSE84910_PMID_27662899_Gonzalez Gonzalez_2016_PMID_27662899
GSM1519714 merge into Daigaku_2015_PMID_25664722
GSM1519715 merge into Daigaku_2015_PMID_25664722
GSM1519716 merge into Daigaku_2015_PMID_25664722
HuaLiTSS Li_2015_PMID_25747261
JuanMata_polyA Mata_2013_PMID_23900342
Lee_PMID_32101745_Hermes Lee_2020_PMID_32101745_Hermes
Mickle_2007 SPLIT into Segurado_2003_PMID_14566325 and Mickle_2007_PMID_18093330
MSchlackow_polyA Schlackow_2013_PMID_24152550
NRhind_RepProfile Xu_2012_PMID_22531001
NRhind_Transcriptome Rhind_2011_PMID_21511999
Thodberg_GSE110976 Thodberg_2018_PMID_30566651
Yadav_and_Dubey_SIDD_PMID23163955 Yadav_2012_PMID_23163955_SIDD
ValWood commented 4 years ago

Oh lovely. It was too irritating in a very WTF??? way before. Good to eventually weed out all the Ensembl-style of "no system" data handling.

I agree that we don't need Bitton_IBP_24709818 because I assume IBP is "intron branch point" and there can only be one consensus set. @bahler can you confirm?

bahler commented 4 years ago

Yes, Bitton_2014_intron_branch_point should be enough.

kimrutherford commented 3 years ago

Thanks Midori. Should I go ahead and rename everything based on your table?

mah11 commented 3 years ago

Should I go ahead and rename everything based on your table?

That would be fine with me! Sounds like Val is happy with it too.

ValWood commented 3 years ago

yes! Thanks!

ValWood commented 3 years ago

Was this done?

kimrutherford commented 3 years ago

Not done yet.

mah11 commented 3 years ago

ping - it'd be nice to get this done so we have sensible names that I can imitate for https://github.com/pombase/curation/issues/3045

kimrutherford commented 2 years ago

I'll do this soon. I plan to do it one weekend as I suspect I will break everything on my first try. I've put it in the PomBase Google calendar for when I'm back from holiday.

kimrutherford commented 2 years ago

I planned to do this over the weekend but I ended up moving my email to Outlook instead. I didn't have time for anything else after that.

I'll try to do this next weekend.

kimrutherford commented 2 years ago

I did the renaming over the weekend. Everything seems fine in JBrowse.

We need to decide about this one:

Bitton_IBP_24709818    ? (not referenced in metadata file, but has same PMID as above, so maybe we don't need it)
mah11 commented 2 years ago

We need to decide about this one: Bitton_IBP_24709818

I vote we stash the contents somewhere not publicly visible Just In Case, and then blitz it. I'm 99% sure it's the same intron branch point data as Bitton_2014_PMID_24709818_intron_branch_point.

ValWood commented 2 years ago

Agreed, I'm sure that was the only dataset for this paper.

kimrutherford commented 2 years ago

I vote we stash the contents somewhere not publicly visible Just In Case, and then blitz it.

OK, thanks.

Done.