tidying the ftp directory and building JBrowse config

ValWood commented 6 years ago

List of how to refer to specific data set types

Human readable	file names
Variation	Variation
Poly(A) sites	PolyA
Replication profiling	Replication_profiling
Intron branch point	Intron_branch_point
Chromatin binding	Chromatin_binding
Nucleosome positioning	Nucleosome_positioning
Transcripts	Transcripts

CV of assay types

Assay type
DamID / tiling array
HT sequencing
mRNA end sequencing
NGS
RNA-seq
tiling microarray

Old to new file name conversion

file naming convention datatype_author_studyID_PMID

old name	new name
BindingSites
Bitton_IBP_24709818/	Intron_branch_point_Bitton_GSE50246_PMID:24709818 (GEO)
DJeffares_Diversity/	There should be three datasets: 1. Sequence_variation_Jeffares_PRJEB2733&PRJEB6284 (ENA) (the variation track says "all data sources" so I presume is it two datasets conglomerated into one track) 2. SNPs_variants_Jeffares_974514578-974688138 (NCBI dbSNP) 3. Indels_Jeffares_974702618-974688139 (NCBI dbSNP)
Daigaku/	??_Daigaku_GSE62108_PMID:25664722 (GEO) replicative polymerase usage, not shown in browser?
GSM1519714/	??_Daigaku_GSE62108_PMID:25664722 (GEO) replicative polymerase usage, not shown in browser?
GSM1519715/	??_Daigaku_GSE62108_PMID:25664722 (GEO) replicative polymerase usage, not shown in browser?
GSM1519716/	??_Daigaku_GSE62108_PMID:25664722 (GEO) replicative polymerase usage, not shown in browser?
GSE62108/	??_Daigaku_GSE62108_PMID:25664722 (GEO), replicative polymerase usage, not shown in browser?
ERP001075/	Wilhelm_E-MTAB-5_PMID:18488015 (ArrayExpress)
ERP001483/	Transcripts_Marguerat_E-MTAB-1154_PMID:23101633 (ArrayExpress)
GSE24360/	Chromatin_binding_Woolcock_GSE24360 (GEO)
GSE41773/	Transcripts_Soriano_GSE41773_PMID:24256300 (GEO)
GSE41773/	Nucleosome_positioning_Soriano_GSE41773_PMID:24256300 (GEO)
HuaLiTSS/	?? I am taking a wild swing here, could this be it? Transcription_start_sites_Li_E-MTAB-3188_PMID:25747261 I don't think this is shown in the genome browser
JuanMata_polyA/	PolyA_Mata_E-MTAB-1642_PMID:23900342 (ArrayExpress)
MSchlackow_polyA/	PolyA_Schlackow_OMICS_06155_PMID:24152550 (omictools)
NRhind_RepProfile/	Replication_profiling_Xu_SRP009399_PMID:22531001 (SRA)
NRhind_Transcriptome/	TranscriptomeRhind? Broad FTP https://www.broadinstitute.org/scientific-community/science/projects/fungal-genome-initiative/schizosaccharomyces-genomes-project

ValWood commented 6 years ago

JBrowse config file

https://docs.google.com/spreadsheets/d/13STLSIvYcqKVaFxz_g3XKOkd-huwbu8qrdQWn9jVN8w/edit?usp=sharing

ValWood commented 6 years ago

I have added Bitton Schlakov Mata to the Google spreadsheet in the required format. @Antonialock could you add the others.

I listed anything that needs to be clarified at the top as a question related to the dataset

Eventually once we have reconciled everything we can delete the spurious columns and this can become the JBowse config file.

ValWood commented 6 years ago

Leave the Jeffares variation data out for now as I don't know if we can do anything with variation data in JBrowse.

mah11 commented 6 years ago

I have moved the binding site file to pombe-embl/ftp_site/pombe/feature_coordinates/TF_binding_motifs.bed (feature_coordinates is a new subdirectory, with a new README that leaves room for us to add more files if we want), so the BindingSites directory can vanish into history (or /dev/null/, as you prefer).

kimrutherford commented 6 years ago

I have moved the binding site file to

Thanks.

I've changed the web server configuration so that the files in the "pombe" directory are also available via HTTPS. That works better with JBrowse.

So that binding site file is also here: https://www.pombase.org/pombase_datasets/genome_sequence_and_features/feature_coordinates/TF_binding_motifs.bed

The external datasets are available from URLs like: https://www.pombase.org/external_datasets/ERP001483/ERS146817.bam

Antonialock commented 6 years ago

so err, this wasn't straight forward....see question marks in the updated original comment

Antonialock commented 6 years ago

I also added the PMID and what database the dataset is located in for ref. we can keep what we like

Antonialock commented 6 years ago

to quote a famous 'politician' BIG MESS

ValWood commented 6 years ago

I think I prefer on balance to have author name and PMID as the file name, and put the database accession in standard format DB:accession as a column in the Google doc

https://docs.google.com/spreadsheets/d/13STLSIvYcqKVaFxz_g3XKOkd-huwbu8qrdQWn9jVN8w/edit?usp=sharing

so that we can display the accession as one of the configuration options. It will be far more consistent.

Also, Can't have ~semi~colons in file names. will need to do PMID_1234

we need a standard, and it will always look to messy using a mixture accessions.

We can deal with pre-publication datasets if we need to later, by having a temporary ID or something. I don't envisage this being an issue because we have enough published data to host...

ValWood commented 6 years ago

Daigaku
https://www.ncbi.nlm.nih.gov/pubmed/25664722 (I know Ensembl definitely hosted this dataset)

ValWood commented 6 years ago

http://listserver.ebi.ac.uk/pipermail/pombelist/2015/004291.html

ValWood commented 6 years ago

Next step is to get the mapping of individual files to tracks from Ensemlb (who wants to ask first? @kimrutherford could you try CCIng bith Steve and Paul)?

then @Antonialock could you populate the Google spreadsheet (with an additional column for the database accession (always format DB:accession) You can probably pre populate quite a lot of the spreadsheet without the file names

kimrutherford commented 6 years ago

Next step is to get the mapping of individual files to tracks from Ensemlb

And fixing the chromosome IDs in the files. I'm working on that now.

@kimrutherford could you try CCIng bith Steve and Paul

Will do.

kimrutherford commented 6 years ago

After a bit of digging I found the Ensembl Genomes pombe configuration file. I've added it to SVN as: website/ensembl_schizosaccharomyces_pombe.ini

Are the Ensembl Genomes tracks the same as we had in the browser for PomBase V1? If so, that file should be all we need. If not, at least we know the name of the file to ask for. I've checked the archive they sent us and it doesn't contain any config files like that.

The config file was buried in: ftp://ftp.ensemblgenomes.org/pub/current/virtual_machines/EnsemblGenomes_38_Browser.ova The file name in that image was "schizosaccharomyces_pombe.ini"

kimrutherford commented 6 years ago

Here is an example of a track configuration.

The GSE24360_01 is the internal track ID. I suggest we keep using those IDs as it's one less thing to change. They are for internal use only and won't be visible to the users.

Ensembl has source_name, description and label which are shown in different places.

JBrowse has a key for each track which is shown in the menu at the left and on top of the track. I think the Ensembl source_name will be too long for that so we'll need to come up with something shorter of each track. We can show the source_name (and any text we like) in the "About this track" pop-up.

I'm writing a bit of Perl that will parse the whole Ensembl config and create a table with all the useful attributes for each track. We can edit that and create the JBrowse config from it.

It looks like the variant section on the Ensembl pages isn't configured with this config file. My guess is that they query the variant databases rather than configuring. So we'll need to add those manually.

[GSE24360_01]
source_name = Chromatin binding - DamID / tiling array - Swi6 binding sites, unfused Dam used as experimental control, repeat 1 - Woolcock (2011)
description = <br>EuropePMC: <a href="http://europepmc.org/abstract/MED/21151114">Woolcock <em>et al</em> 2011</a><br>GEO: <a href="http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE24360">GSE24360</a>
caption     = Woolcock (2011)
source_url  = http://ftp.ebi.ac.uk/pub/databases/pombase/DATASET/GSE24360/GSE24360_expr_M_CorV2_01.bw
source_type = bigWig
display     = off
colour      = #263B21

kimrutherford commented 6 years ago

Two sections are commented out. Should I include these in the table I make?

[MSchlackow_13_ssc_P] 
source_name       = Poly(A) sites - RNA-Seq - strand specific cycling (forward strand) - Schlackow (2013)                                                                                  
description       = <br>EuropePMC: <a href="http://europepmc.org/abstract/MED/24152550">Schlackow <em>et al</em> 2013</a>
caption           = Schlackow (2013)                                        
source_url        = http://ftp.ebi.ac.uk/pub/databases/pombase/DATASET/MSchlackow_polyA/Strandspecific_cycling_P.bw
source_type       = bigWig
display           = off
colour             = #263B21

[MSchlackow_14_ssc_N]
source_name       = Poly(A) sites - RNA-Seq - strand specific cycling (reverse strand) - Schlackow (2013)                                           
description       = <br>EuropePMC: <a href="http://europepmc.org/abstract/MED/24152550">Schlackow <em>et al</em> 2013</a>
caption           = Schlackow (2013)
source_url        = http://ftp.ebi.ac.uk/pub/databases/pombase/DATASET/MSchlackow_polyA/Strandspecific_cycling_N.bw
source_type       = bigWig
display           = off
colour            = #263B21

kimrutherford commented 6 years ago

I've just noticed there are about 30 other tracks that are in the config file but are disabled. They're all from "Wilhelm (2008)". They aren't visible in the Ensembl Genomes browser either. Does anyone know the story behind that?

They look like:


[ERS078438]
source_name        = Transcripts - RNA-Seq - run34_s3 (ERS078438) - Wilhelm (2008)
caption            = Wilhelm (2008)
description        = <br>High throughput sequenceing of fission yeast to survey the dynamic repertoire of a eukaryotic transcriptome at the single nucleotide resolution (Study <a href="http://www.ebi.ac.uk/ena/data/view/ERP001075">ERP001075</a>)<br>EuropePMC: <a href="http://europepmc.org/abstract/MED/18488015">Wilhelm 2012</a><br>ArrayExpress: <a href="https://www.ebi.ac.uk/arrayexpress/experiments/E-MTAB-5/">E-MTAB-5</a>
source_url         = http://ftp.ebi.ac.uk/pub/databases/pombase/DATASET/ERP001075/ERS078438.bw
source_type        = bigWig
display            = off
colour             = #263B21

kimrutherford commented 6 years ago

I'm writing a bit of Perl that will parse the whole Ensembl config and create a table

I've had a go at that. I've pulled out what I can and tried to break the different bits into separate columns. It's it Dropbox: Dropbox/pombase/PomBase_website/ensembl_track_config.tsv

I'm likely to overwrite that file as I tweak the Perl script so if you edit that file the changes will be lost.

Once we're happy with what we've got from the Ensembl config it will be quite straight forward to generate JBrowse track configuration.

The "description" from the Ensembl config is HTML. I've attempted to pull the useful stuff out into the columns labelled:

short_description
first_author_and_date
pubmed_id
array_express_id
geo_id
ena_study_id

kimrutherford commented 6 years ago

first_author_and_date

That's now two columns:

first_author
pub_date

ValWood commented 6 years ago

great, these are the infor I was trying to collate here.

https://docs.google.com/spreadsheets/d/13STLSIvYcqKVaFxz_g3XKOkd-huwbu8qrdQWn9jVN8w/edit#gid=0

If you can populate as many as possible, @Antonialock will probably only need to populate the track description manually.

ValWood commented 6 years ago

Two sections are commented out. Should I include these in the table I make?

that accounts for why I could not match up the tracks with the directory. I think it was because some tracks were redundant. However, I don't think these were the correct ones to hide. It's all a bit confusing because the names don't really make sense. Some are described as strand specific and some as all data, but some described as "all data" have a forward or reverse strand caption.

I suggest , for now, include them all. Once we have the set up we can ask the authors to check. Or @Antonialock can check with them as she snity checks the descriptions in the spreadhseet are corrent

ValWood commented 6 years ago

I think the Willhelm dataset was one we were trying to get hosted next (I was trying to get all of the trancriptome data next so we could assess the best UTR predictions), but it got sidelined because of the Drupal migration..... Include it and see what happens.

kimrutherford commented 6 years ago

but other Ids in one column comma separated ?

It's easier to get link-outs working from JBrowse if we keep them separate.

Ignore variation data for now.

JBrowse can handle VCF files so it should be able to do something with the variation data.

Once we have the set up we can ask the authors to check.

Could we ask the authors to contribute descriptions for the files/tracks in the cases where we have trouble?

ValWood commented 6 years ago

Sorry that wasn't clear. we could probably just use one database ID per dataset. Things in Array Express have a GEO ID and vice versa, and and ENA ID (for AE) or a SRA ID (for GEO) I think maybe we only need one accession....and that should always be the AE or GEO where the dataset was submitted? and we should report the GEO/AE accession which hold all of the meta data, not the ENA/SRA which are the nucleotide repositories.

Antonialock commented 6 years ago

Just to make sure, I'm waiting for a list of tracks that I will then add missing descriptions to?

ValWood commented 6 years ago

wait until Kim has got as much data into the spreadsheet automated.

Then need to add any missing data/fields (it might only be tweaking the track display name if you are lucky)

Antonialock commented 6 years ago

goodie goodie (I'd be surprised :p )

On Fri, Feb 2, 2018 at 10:13 AM, Val Wood notifications@github.com wrote:

wait until Kim has got as much data into the spreadsheet automated.

Then need to add any missing data/fields (it might only be tweaking the track display name if you are lucky)

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/pombase/website/issues/672#issuecomment-362544706, or mute the thread https://github.com/notifications/unsubscribe-auth/AMI00i5lVw2TGZXruOoZdnRDF579l1kjks5tQt_MgaJpZM4R1bhu .

-- Antonia Lock, PhD PomBase Biocurator, http://www.pombase.org Department of Genetics, Evolution and Environment, The Darwin Building, University College London London WC1E 6BT, UK

kimrutherford commented 6 years ago

wait until Kim has got as much data into the spreadsheet automated.

I think I've extracted all I can from the Ensembl config into here: Dropbox/pombase/PomBase_website/ensembl_track_config.tsv

Perhaps we could add any extra information as columns in that file? Some of the information is duplicated so we can probably remove some columns too.

MIght be good to talk about that on Thursday. Once we add a bit more to that file I'll write a little script to create the JBrowse configuration from it.

ValWood commented 6 years ago

OK we can go through this tomorrow. It seems that you have made good progress.

ValWood commented 6 years ago

I think we are all in here for this: https://github.com/pombase/website/issues/691

pombase / website

tidying the ftp directory and building JBrowse config #672