Closed ValWood closed 6 years ago
I have added Bitton Schlakov Mata to the Google spreadsheet in the required format. @Antonialock could you add the others.
I listed anything that needs to be clarified at the top as a question related to the dataset
Eventually once we have reconciled everything we can delete the spurious columns and this can become the JBowse config file.
Leave the Jeffares variation data out for now as I don't know if we can do anything with variation data in JBrowse.
I have moved the binding site file to pombe-embl/ftp_site/pombe/feature_coordinates/TF_binding_motifs.bed (feature_coordinates is a new subdirectory, with a new README that leaves room for us to add more files if we want), so the BindingSites directory can vanish into history (or /dev/null/, as you prefer).
I have moved the binding site file to
Thanks.
I've changed the web server configuration so that the files in the "pombe" directory are also available via HTTPS. That works better with JBrowse.
So that binding site file is also here: https://www.pombase.org/pombase_datasets/genome_sequence_and_features/feature_coordinates/TF_binding_motifs.bed
The external datasets are available from URLs like: https://www.pombase.org/external_datasets/ERP001483/ERS146817.bam
so err, this wasn't straight forward....see question marks in the updated original comment
I also added the PMID and what database the dataset is located in for ref. we can keep what we like
to quote a famous 'politician' BIG MESS
I think I prefer on balance to have author name and PMID as the file name, and put the database accession in standard format DB:accession as a column in the Google doc
https://docs.google.com/spreadsheets/d/13STLSIvYcqKVaFxz_g3XKOkd-huwbu8qrdQWn9jVN8w/edit?usp=sharing
so that we can display the accession as one of the configuration options. It will be far more consistent.
Also, Can't have ~semi~colons in file names. will need to do PMID_1234
we need a standard, and it will always look to messy using a mixture accessions.
We can deal with pre-publication datasets if we need to later, by having a temporary ID or something. I don't envisage this being an issue because we have enough published data to host...
Daigaku
https://www.ncbi.nlm.nih.gov/pubmed/25664722
(I know Ensembl definitely hosted this dataset)
Next step is to get the mapping of individual files to tracks from Ensemlb (who wants to ask first? @kimrutherford could you try CCIng bith Steve and Paul)?
Next step is to get the mapping of individual files to tracks from Ensemlb
And fixing the chromosome IDs in the files. I'm working on that now.
@kimrutherford could you try CCIng bith Steve and Paul
Will do.
After a bit of digging I found the Ensembl Genomes pombe configuration file. I've added it to SVN as: website/ensembl_schizosaccharomyces_pombe.ini
Are the Ensembl Genomes tracks the same as we had in the browser for PomBase V1? If so, that file should be all we need. If not, at least we know the name of the file to ask for. I've checked the archive they sent us and it doesn't contain any config files like that.
The config file was buried in: ftp://ftp.ensemblgenomes.org/pub/current/virtual_machines/EnsemblGenomes_38_Browser.ova The file name in that image was "schizosaccharomyces_pombe.ini"
Here is an example of a track configuration.
The GSE24360_01
is the internal track ID. I suggest we keep using those IDs as it's one less thing to change. They are for internal use only and won't be visible to the users.
Ensembl has source_name
, description
and label
which are shown in different places.
JBrowse has a key
for each track which is shown in the menu at the left and on top of the track. I think the Ensembl source_name
will be too long for that so we'll need to come up with something shorter of each track. We can show the source_name
(and any text we like) in the "About this track" pop-up.
I'm writing a bit of Perl that will parse the whole Ensembl config and create a table with all the useful attributes for each track. We can edit that and create the JBrowse config from it.
It looks like the variant section on the Ensembl pages isn't configured with this config file. My guess is that they query the variant databases rather than configuring. So we'll need to add those manually.
[GSE24360_01]
source_name = Chromatin binding - DamID / tiling array - Swi6 binding sites, unfused Dam used as experimental control, repeat 1 - Woolcock (2011)
description = <br>EuropePMC: <a href="http://europepmc.org/abstract/MED/21151114">Woolcock <em>et al</em> 2011</a><br>GEO: <a href="http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE24360">GSE24360</a>
caption = Woolcock (2011)
source_url = http://ftp.ebi.ac.uk/pub/databases/pombase/DATASET/GSE24360/GSE24360_expr_M_CorV2_01.bw
source_type = bigWig
display = off
colour = #263B21
Two sections are commented out. Should I include these in the table I make?
[MSchlackow_13_ssc_P]
source_name = Poly(A) sites - RNA-Seq - strand specific cycling (forward strand) - Schlackow (2013)
description = <br>EuropePMC: <a href="http://europepmc.org/abstract/MED/24152550">Schlackow <em>et al</em> 2013</a>
caption = Schlackow (2013)
source_url = http://ftp.ebi.ac.uk/pub/databases/pombase/DATASET/MSchlackow_polyA/Strandspecific_cycling_P.bw
source_type = bigWig
display = off
colour = #263B21
[MSchlackow_14_ssc_N]
source_name = Poly(A) sites - RNA-Seq - strand specific cycling (reverse strand) - Schlackow (2013)
description = <br>EuropePMC: <a href="http://europepmc.org/abstract/MED/24152550">Schlackow <em>et al</em> 2013</a>
caption = Schlackow (2013)
source_url = http://ftp.ebi.ac.uk/pub/databases/pombase/DATASET/MSchlackow_polyA/Strandspecific_cycling_N.bw
source_type = bigWig
display = off
colour = #263B21
I've just noticed there are about 30 other tracks that are in the config file but are disabled. They're all from "Wilhelm (2008)". They aren't visible in the Ensembl Genomes browser either. Does anyone know the story behind that?
They look like:
[ERS078438]
source_name = Transcripts - RNA-Seq - run34_s3 (ERS078438) - Wilhelm (2008)
caption = Wilhelm (2008)
description = <br>High throughput sequenceing of fission yeast to survey the dynamic repertoire of a eukaryotic transcriptome at the single nucleotide resolution (Study <a href="http://www.ebi.ac.uk/ena/data/view/ERP001075">ERP001075</a>)<br>EuropePMC: <a href="http://europepmc.org/abstract/MED/18488015">Wilhelm 2012</a><br>ArrayExpress: <a href="https://www.ebi.ac.uk/arrayexpress/experiments/E-MTAB-5/">E-MTAB-5</a>
source_url = http://ftp.ebi.ac.uk/pub/databases/pombase/DATASET/ERP001075/ERS078438.bw
source_type = bigWig
display = off
colour = #263B21
I'm writing a bit of Perl that will parse the whole Ensembl config and create a table
I've had a go at that. I've pulled out what I can and tried to break the different bits into separate columns. It's it Dropbox: Dropbox/pombase/PomBase_website/ensembl_track_config.tsv
I'm likely to overwrite that file as I tweak the Perl script so if you edit that file the changes will be lost.
Once we're happy with what we've got from the Ensembl config it will be quite straight forward to generate JBrowse track configuration.
The "description" from the Ensembl config is HTML. I've attempted to pull the useful stuff out into the columns labelled:
first_author_and_date
That's now two columns:
great, these are the infor I was trying to collate here.
https://docs.google.com/spreadsheets/d/13STLSIvYcqKVaFxz_g3XKOkd-huwbu8qrdQWn9jVN8w/edit#gid=0
If you can populate as many as possible, @Antonialock will probably only need to populate the track description manually.
Two sections are commented out. Should I include these in the table I make?
that accounts for why I could not match up the tracks with the directory. I think it was because some tracks were redundant. However, I don't think these were the correct ones to hide. It's all a bit confusing because the names don't really make sense. Some are described as strand specific and some as all data, but some described as "all data" have a forward or reverse strand caption.
I suggest , for now, include them all. Once we have the set up we can ask the authors to check. Or @Antonialock can check with them as she snity checks the descriptions in the spreadhseet are corrent
I think the Willhelm dataset was one we were trying to get hosted next (I was trying to get all of the trancriptome data next so we could assess the best UTR predictions), but it got sidelined because of the Drupal migration..... Include it and see what happens.
but other Ids in one column comma separated ?
It's easier to get link-outs working from JBrowse if we keep them separate.
Ignore variation data for now.
JBrowse can handle VCF files so it should be able to do something with the variation data.
Once we have the set up we can ask the authors to check.
Could we ask the authors to contribute descriptions for the files/tracks in the cases where we have trouble?
Sorry that wasn't clear. we could probably just use one database ID per dataset. Things in Array Express have a GEO ID and vice versa, and and ENA ID (for AE) or a SRA ID (for GEO) I think maybe we only need one accession....and that should always be the AE or GEO where the dataset was submitted? and we should report the GEO/AE accession which hold all of the meta data, not the ENA/SRA which are the nucleotide repositories.
Just to make sure, I'm waiting for a list of tracks that I will then add missing descriptions to?
wait until Kim has got as much data into the spreadsheet automated.
Then need to add any missing data/fields (it might only be tweaking the track display name if you are lucky)
goodie goodie (I'd be surprised :p )
On Fri, Feb 2, 2018 at 10:13 AM, Val Wood notifications@github.com wrote:
wait until Kim has got as much data into the spreadsheet automated.
Then need to add any missing data/fields (it might only be tweaking the track display name if you are lucky)
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/pombase/website/issues/672#issuecomment-362544706, or mute the thread https://github.com/notifications/unsubscribe-auth/AMI00i5lVw2TGZXruOoZdnRDF579l1kjks5tQt_MgaJpZM4R1bhu .
-- Antonia Lock, PhD PomBase Biocurator, http://www.pombase.org Department of Genetics, Evolution and Environment, The Darwin Building, University College London London WC1E 6BT, UK
wait until Kim has got as much data into the spreadsheet automated.
I think I've extracted all I can from the Ensembl config into here: Dropbox/pombase/PomBase_website/ensembl_track_config.tsv
Perhaps we could add any extra information as columns in that file? Some of the information is duplicated so we can probably remove some columns too.
MIght be good to talk about that on Thursday. Once we add a bit more to that file I'll write a little script to create the JBrowse configuration from it.
OK we can go through this tomorrow. It seems that you have made good progress.
I think we are all in here for this: https://github.com/pombase/website/issues/691
List of how to refer to specific data set types
CV of assay types
Old to new file name conversion
file naming convention datatype_author_studyID_PMID