lgatto / ProteomicsAnnotationHubData

Annotation hub data for proteomics data
2 stars 1 forks source link

Testing version 0.99 #4

Closed lgatto closed 9 years ago

lgatto commented 9 years ago

@sonali-bioc, could you test the latest version of the package.

All the action happens in PXD000001.R, with helper functions in utils.R. The basic idea to add new datasets is to create a list with AnnotationHub metadata, that is then updated and checked by helper functions to, eventually, create the makePXD000001___ functions to be passed as argument to makeAnnotationHubResource.

I will be working on the vignette tomorrow.

sonali-bioc commented 9 years ago

@lgatto, yes I will ! probably late today or tomorrow morning..

sonali-bioc commented 9 years ago

Dear @lgatto , I tested the 0.99.0 version and we have quite a few issues.. Here is the output

> options(ANNOTATION_HUB_URL="http://gamay:9393")
> library(AnnotationHub)
> ah = AnnotationHub()
snapshotDate(): 2015-07-29
> length(ah)
[1] 34810
> tail(ah)
AnnotationHub with 6 records
# snapshotDate(): 2015-07-29
# $dataprovider: PRIDE, dbSNP, ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/
# $species: Erwinia carotovora, Homo sapiens, Lactobacillus jensenii_JV-V16
# $rdataclass: VcfFile, AAStringSet, MSnSet, OrgDb, mzRident, mzRpwiz
# additional mcols(): taxonomyid, genome, description, tags, sourceurl,
#   sourcetype
# retrieve records with, e.g., 'object[["AH49005"]]'

            title
  AH49005 | org.Lactobacillus_jensenii_JV-V16.eg.sqlite
  AH49006 | common_all_20150603_papu.vcf.gz
  AH49007 | Four human TMT spliked-in proteins in an Erwinia carotovora back...
  AH49008 | Four human TMT spliked-in proteins in an Erwinia carotovora back...
  AH49009 | Four human TMT spliked-in proteins in an Erwinia carotovora back...
  AH49010 | Four human TMT spliked-in proteins in an Erwinia carotovora back...

Four Files added at the bottom - and you can see them when I do a tail(ah)

Suggestion here - the title for all four of them is the same - so its very tough to figure out what is what . I suggest moving whatever is in the title field to the description field in addition to what is already there. Usually the convention for the title field is the filename.

like

> gtfFiles <- query(ah , c("ensembl", "GTF", "homo sapiens"))
> gtfFiles
AnnotationHub with 13 records
# snapshotDate(): 2015-07-29
# $dataprovider: Ensembl
# $species: Homo sapiens
# $rdataclass: GRanges
# additional mcols(): taxonomyid, genome, description, tags, sourceurl,
#   sourcetype
# retrieve records with, e.g., 'object[["AH7558"]]'

            title
  AH7558  | Homo_sapiens.GRCh37.70.gtf
  AH7619  | Homo_sapiens.GRCh37.69.gtf
  AH7666  | Homo_sapiens.GRCh37.71.gtf
  AH7726  | Homo_sapiens.GRCh37.72.gtf
  AH7790  | Homo_sapiens.GRCh37.73.gtf
  ...       ...
  AH28674 | Homo_sapiens.GRCh38.76.gtf
  AH28743 | Homo_sapiens.GRCh38.79.gtf
  AH28812 | Homo_sapiens.GRCh38.77.gtf
  AH47066 | Homo_sapiens.GRCh38.80.gtf
  AH47963 | Homo_sapiens.GRCh38.81.gtf

> gtfFiles$description
 [1] "Gene Annotation for Homo sapiens" "Gene Annotation for Homo sapiens"
 [3] "Gene Annotation for Homo sapiens" "Gene Annotation for Homo sapiens"
 [5] "Gene Annotation for Homo sapiens" "Gene Annotation for Homo sapiens"
 [7] "Gene Annotation for Homo sapiens" "Gene Annotation for Homo sapiens"
 [9] "Gene Annotation for Homo sapiens" "Gene Annotation for Homo sapiens"
[11] "Gene Annotation for Homo sapiens" "Gene Annotation for Homo sapiens"
[13] "Gene Annotation for Homo sapiens"

> gtfFiles$title
 [1] "Homo_sapiens.GRCh37.70.gtf" "Homo_sapiens.GRCh37.69.gtf"
 [3] "Homo_sapiens.GRCh37.71.gtf" "Homo_sapiens.GRCh37.72.gtf"
 [5] "Homo_sapiens.GRCh37.73.gtf" "Homo_sapiens.GRCh37.74.gtf"
 [7] "Homo_sapiens.GRCh37.75.gtf" "Homo_sapiens.GRCh38.78.gtf"
 [9] "Homo_sapiens.GRCh38.76.gtf" "Homo_sapiens.GRCh38.79.gtf"
[11] "Homo_sapiens.GRCh38.77.gtf" "Homo_sapiens.GRCh38.80.gtf"
[13] "Homo_sapiens.GRCh38.81.gtf"

I didnt get more creative with my description field for the GTF files - but I could have. But basically the title shows up in the show() method and just looking at that the user has an idea which file he is getting.

Your thoughts?

FYI - This is what metadata is produced for all 4 files using inst/script/testing.R and is inserted into the database of AnnotationHub

> prot1
[[1]]
class: AnnotationHubMetadata
AnnotationHubRoot: /var/FastRWeb/web
BiocVersion: 3.2
Coordinate_1_based: TRUE
DataProvider: PRIDE
DerivedMd5: NA
Description: Expected reporter ion ratios: Erwinia peptides:
  1:1:1:1:1:1; Enolase spike (sp|P00924|ENO1_YEAST): 10:5:2.5:1:2.5:10;
  BSA spike (sp|P02769|ALBU_BOVIN): 1:2.5:5:10:5:1; PhosB spike
  (sp|P00489|PYGM_RABIT): 2:2:2:2:1:1; Cytochrome C spike
  (sp|P62894|CYC_BOVIN): 1:1:1:1:1:2.
DispatchClass: mzRpwiz
Genome: NA
Location_Prefix: ftp://ftp.pride.ebi.ac.uk/
Maintainer: Laurent Gatto <lg390@cam.ac.uk>
Notes: NA
PreparerClass: PXD000001MzMLToMzRPwizPreparer
RDataClass: mzRpwiz
RDataDateAdded: 2015-07-29
RDataPath:
  pride/data/archive/2012/03/PXD000001/TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01-20141210.mzML
Recipe: NA
SourceLastModifiedDate: NA
SourceMd5: NA
SourceSize: NA
SourceType: mzML
SourceUrl:
  ftp://ftp.pride.ebi.ac.uk/pride/data/archive/2012/03/PXD000001/TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01-20141210.mzML
SourceVersion: 2015-01-16 07:51:22
Species: Erwinia carotovora
Tags: Proteomics TMT6 LTQ Orbitrap Velos PMID:23692960
TaxonomyId: 554
Title: Four human TMT spliked-in proteins in an Erwinia carotovora
  background

> prot2
[[1]]
class: AnnotationHubMetadata
AnnotationHubRoot: /var/FastRWeb/web
BiocVersion: 3.2
Coordinate_1_based: TRUE
DataProvider: PRIDE
DerivedMd5: NA
Description: Expected reporter ion ratios: Erwinia peptides:
  1:1:1:1:1:1; Enolase spike (sp|P00924|ENO1_YEAST): 10:5:2.5:1:2.5:10;
  BSA spike (sp|P02769|ALBU_BOVIN): 1:2.5:5:10:5:1; PhosB spike
  (sp|P00489|PYGM_RABIT): 2:2:2:2:1:1; Cytochrome C spike
  (sp|P62894|CYC_BOVIN): 1:1:1:1:1:2.
DispatchClass: MSnSet
Genome: NA
Location_Prefix: ftp://ftp.pride.ebi.ac.uk/
Maintainer: Laurent Gatto <lg390@cam.ac.uk>
Notes: NA
PreparerClass: PXD000001MzTabToMSnSetPreparer
RDataClass: MSnSet
RDataDateAdded: 2015-07-29
RDataPath: pride/data/archive/2012/03/PXD000001/F063721.dat-mztab.txt
Recipe: ProteomicsAnnotationHubData:::PXD00001MzTabToMSnSet
SourceLastModifiedDate: NA
SourceMd5: NA
SourceSize: NA
SourceType: mzTab
SourceUrl:
  ftp://ftp.pride.ebi.ac.uk/pride/data/archive/2012/03/PXD000001/F063721.dat-mztab.txt
SourceVersion: 2012-03-15 16:00:19
Species: Erwinia carotovora
Tags: Proteomics TMT6 LTQ Orbitrap Velos PMID:23692960
TaxonomyId: 554
Title: Four human TMT spliked-in proteins in an Erwinia carotovora
  background

> prot3
[[1]]
class: AnnotationHubMetadata
AnnotationHubRoot: /var/FastRWeb/web
BiocVersion: 3.2
Coordinate_1_based: TRUE
DataProvider: PRIDE
DerivedMd5: NA
Description: Expected reporter ion ratios: Erwinia peptides:
  1:1:1:1:1:1; Enolase spike (sp|P00924|ENO1_YEAST): 10:5:2.5:1:2.5:10;
  BSA spike (sp|P02769|ALBU_BOVIN): 1:2.5:5:10:5:1; PhosB spike
  (sp|P00489|PYGM_RABIT): 2:2:2:2:1:1; Cytochrome C spike
  (sp|P62894|CYC_BOVIN): 1:1:1:1:1:2.
DispatchClass: mzRident
Genome: NA
Location_Prefix: http://s3.amazonaws.com/annotationhub/
Maintainer: Laurent Gatto <lg390@cam.ac.uk>
Notes: NA
PreparerClass: PXD000001MzidToMzRidentPreparer
RDataClass: mzRident
RDataDateAdded: 2015-07-29
RDataPath:
  pride/data/archive/2012/03/PXD000001/TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01-20141210.mzid
Recipe: NA
SourceLastModifiedDate: NA
SourceMd5: NA
SourceSize: NA
SourceType: mzid
SourceUrl:
  http://s3.amazonaws.com/annotationhub/pride/data/archive/2012/03/PXD000001/
SourceVersion: 2015-07-29 17:07:53
Species: Erwinia carotovora
Tags: Proteomics TMT6 LTQ Orbitrap Velos PMID:23692960
TaxonomyId: 554
Title: Four human TMT spliked-in proteins in an Erwinia carotovora
  background

> prot4
[[1]]
class: AnnotationHubMetadata
AnnotationHubRoot: /var/FastRWeb/web
BiocVersion: 3.2
Coordinate_1_based: TRUE
DataProvider: PRIDE
DerivedMd5: NA
Description: Expected reporter ion ratios: Erwinia peptides:
  1:1:1:1:1:1; Enolase spike (sp|P00924|ENO1_YEAST): 10:5:2.5:1:2.5:10;
  BSA spike (sp|P02769|ALBU_BOVIN): 1:2.5:5:10:5:1; PhosB spike
  (sp|P00489|PYGM_RABIT): 2:2:2:2:1:1; Cytochrome C spike
  (sp|P62894|CYC_BOVIN): 1:1:1:1:1:2.
DispatchClass: AAStringSet
Genome: NA
Location_Prefix: ftp://ftp.pride.ebi.ac.uk/
Maintainer: Laurent Gatto <lg390@cam.ac.uk>
Notes: NA
PreparerClass: PXD000001MzMLToAAStringSetPreparer
RDataClass: AAStringSet
RDataDateAdded: 2015-07-29
RDataPath:
  pride/data/archive/2012/03/PXD000001/erwinia_carotovora.fasta
Recipe: NA
SourceLastModifiedDate: NA
SourceMd5: NA
SourceSize: NA
SourceType: FASTA
SourceUrl:
  ftp://ftp.pride.ebi.ac.uk/pride/data/archive/2012/03/PXD000001/erwinia_carotovora.fasta
SourceVersion: 2012-03-15 16:00:17
Species: Erwinia carotovora
Tags: Proteomics TMT6 LTQ Orbitrap Velos PMID:23692960
TaxonomyId: 554
Title: Four human TMT spliked-in proteins in an Erwinia carotovora
  background
sonali-bioc commented 9 years ago

@lgatto continuing with how the files look - Here is how the first file and its metadata look like -

> testfile1 <- ah['AH49007']
> testfile1
AnnotationHub with 1 record
# snapshotDate(): 2015-07-29
# names(): AH49007
# $dataprovider: PRIDE
# $species: Erwinia carotovora
# $rdataclass: mzRpwiz
# $title: Four human TMT spliked-in proteins in an Erwinia carotovora backgr...
# $description: Expected reporter ion ratios: Erwinia peptides: 1:1:1:1:1:1;...
# $taxonomyid: 554
# $genome: NA
# $sourcetype: mzML
# $sourceurl: ftp://ftp.pride.ebi.ac.uk/pride/data/archive/2012/03/PXD000001...
# $sourcelastmodifieddate: NA
# $sourcesize: NA
# $tags: Proteomics, TMT6, LTQ Orbitrap Velos, PMID:23692960
# retrieve record with 'object[["AH49007"]]'
> testfile1data <- ah[['AH49007']]
require(“mzR”)
retrieving 1 resource
  |======================================================================| 100%

> testfile1data
Mass Spectrometry file handle.
Filename:  55314
Number of scans:  7534
sonali-bioc commented 9 years ago

Oops - closed that by mistake. Sorry!

sonali-bioc commented 9 years ago

@lgatto , for the second file - the MsnSet file - I get the following error -


> ah['AH49008']
AnnotationHub with 1 record
# snapshotDate(): 2015-07-29
# names(): AH49008
# $dataprovider: PRIDE
# $species: Erwinia carotovora
# $rdataclass: MSnSet
# $title: Four human TMT spliked-in proteins in an Erwinia carotovora backgr...
# $description: Expected reporter ion ratios: Erwinia peptides: 1:1:1:1:1:1;...
# $taxonomyid: 554
# $genome: NA
# $sourcetype: mzTab
# $sourceurl: ftp://ftp.pride.ebi.ac.uk/pride/data/archive/2012/03/PXD000001...
# $sourcelastmodifieddate: NA
# $sourcesize: NA
# $tags: Proteomics, TMT6, LTQ Orbitrap Velos, PMID:23692960
# retrieve record with 'object[["AH49008"]]'

> testfile2data <- ah[['AH49008']]
Error: failed to create 'AnnotationHubResource' instance
  name: AH49008
  title: Four human TMT spliked-in proteins in an Erwinia carotovora background
  reason: “MSnSetResource” is not a defined class

The reason for the error is that there is no setMethod() inside inst/scripts/get1Methods.R for MSnSetResource - Thus, I added the following one - based on my limited knowledge , I am assuming that the


setClass("MSnSetResource", contains="AnnotationHubResource")
setMethod(".get1", "MSnSetResource",
    function(x, ...)
{
    yy <- cache(.hub(x))
    .require("MSnbase")
    load(yy)
})

For this type of file - we want a file to be downloaded from the pride server and then pre-processed, the pre-processed product should be uploaded to amazon s3 machine and then the user directly get the object from amazon S3.

For these things to happen i) the recipe argument should not be NA( which it isnt) ii) the location_prefix should be the amazonBaseUrl iii) the rdatapath should be whatever you directory structure you want after amazonBaseUrl on amazon s3. iv) the sourceurl, should be the actual url on pride's server.

But if you look at the metadata created above for prot2 , we have everything correct expect the location_Prefix

Recipe: ProteomicsAnnotationHubData:::PXD00001MzTabToMSnSet
Location_Prefix: ftp://ftp.pride.ebi.ac.uk/
RDataPath: pride/data/archive/2012/03/PXD000001/F063721.dat-mztab.txt
SourceUrl:
  ftp://ftp.pride.ebi.ac.uk/pride/data/archive/2012/03/PXD000001/F063721.dat-mztab.txt

Once we fix that - this recipe should work !

sonali-bioc commented 9 years ago

@lgatto , for the third file - I get the following error -

> testfile3 <- ah['AH49009']
> testfile3
AnnotationHub with 1 record
# snapshotDate(): 2015-07-29
# names(): AH49009
# $dataprovider: PRIDE
# $species: Erwinia carotovora
# $rdataclass: mzRident
# $title: Four human TMT spliked-in proteins in an Erwinia carotovora backgr...
# $description: Expected reporter ion ratios: Erwinia peptides: 1:1:1:1:1:1;...
# $taxonomyid: 554
# $genome: NA
# $sourcetype: mzid
# $sourceurl: http://s3.amazonaws.com/annotationhub/pride/data/archive/2012/...
# $sourcelastmodifieddate: NA
# $sourcesize: NA
# $tags: Proteomics, TMT6, LTQ Orbitrap Velos, PMID:23692960
# retrieve record with 'object[["AH49009"]]'
> testfile3$sourceurl
[1] "http://s3.amazonaws.com/annotationhub/pride/data/archive/2012/03/PXD000001/"

> testfile3data <- ah[['AH49009']]
retrieving 1 resource
Downloading: 240 B
Error: failed to load 'AnnotationHub' resource
  name: AH49009
  title: Four human TMT spliked-in proteins in an Erwinia carotovora background
  reason: 1 resources failed to download
In addition: There were 11 warnings (use warnings() to see them)

The reason for the error is that the sourceurl, rdatapath and location_prefix for this file is messed up.

Here, ideally we want -
i) the recipe argument should be NA ii) the location_prefix should be the prideBaseUrl iii) the rdatapath should be gsub(prideBaseUrl,"", sourceurl) iv) th sourceurl, should be the actual url on prode's server.

But if you look at the metadata created above for prot3 , we have quite a few issues.

Location_Prefix: http://s3.amazonaws.com/annotationhub/
RDataPath:
  pride/data/archive/2012/03/PXD000001/TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01-20141210.mzid
SourceUrl:
  http://s3.amazonaws.com/annotationhub/pride/data/archive/2012/03/PXD000001/
Recipe: NA

The sourceurl is incorrect ( doesnt contain the file name) and the location_prefix points to amazonS3 instead of prode.

Once we fix that - this recipe should work !

sonali-bioc commented 9 years ago

@lgatto , for the fourth file - SUCCESS!

> ah['AH49010']
AnnotationHub with 1 record
# snapshotDate(): 2015-07-29
# names(): AH49010
# $dataprovider: PRIDE
# $species: Erwinia carotovora
# $rdataclass: AAStringSet
# $title: Four human TMT spliked-in proteins in an Erwinia carotovora backgr...
# $description: Expected reporter ion ratios: Erwinia peptides: 1:1:1:1:1:1;...
# $taxonomyid: 554
# $genome: NA
# $sourcetype: FASTA
# $sourceurl: ftp://ftp.pride.ebi.ac.uk/pride/data/archive/2012/03/PXD000001...
# $sourcelastmodifieddate: NA
# $sourcesize: NA
# $tags: Proteomics, TMT6, LTQ Orbitrap Velos, PMID:23692960
# retrieve record with 'object[["AH49010"]]'
>
>
> ah[['AH49010']]
require(“Biostrings”)
retrieving 1 resource
  |======================================================================| 100%
  A AAStringSet instance of length 4499
       width seq                                            names
   [1]   147 MADITLISGSTLGSAEYVAEHL...QHQIPEDPAEEWLGSWVNLLK ECA0001 putative ...
   [2]   153 VAEIYQIDNLDRGILSALMENA...EIQSTETLISLQNPIMRTIAP ECA0002 AsnC-fami...
   [3]   330 MKKQYIEKQQQISFVKSFFSSQ...IGQVQCGVWPQPLRESVSGLL ECA0003 putative ...
   [4]   492 MITLESLEMLLSIDENELLDDL...WRFDTGLKSRLMRRWQHGKAY ECA0004 conserved...
   [5]   499 MRQTAALAERISRLSHALEHGL...AKIEASLQQVAEQIQQSEQQD ECA0005 conserved...
   ...   ... ...
[4495]   634 MSDKIIHLTDDSFDTDVLKADG...RRKVDPLRVFASDMARRLELL trx-rv3790 trx-rv...
[4496]    93 MTKMNNKARRTARELKHLGASI...RELRDEFPMGYLGDYKDDDDK TimBlower TimBlower
[4497]   309 MFSNLSKRWAQRTLSKSFYSTA...KFKWAGIKTRKFVFNPPKPRK sp|P07143|CY1_YEA...
[4498]   231 FPTDDDDKIVGGYTCAANSIPY...PGVYTKVCNYVNWIQQTIAAN sp|P00761|TRYP_PI...
[4499]   269 GVSGSCNIDVVCPEGNGHRDVI...DAAGTGAQFIDGLDSTGTPPV sp|Q7M135|LYSC_LY...
sonali-bioc commented 9 years ago

Summarizing things that need to be done for next version -

1) Make the title of each file to contain file name or something small/meaningful rather than same text for all 4 types of files.

2) Is the setMethod() for "MSnSetResource" at "/inst/scripts/get1Methods.R" okay?

3) Correct the sourceurl/rdatapath/location_prefix for

4) consider pre-processing fasta files into "AAStringSet" objects before hand - and storing them on amazon s3? It takes quite sometime to download the fasta file! It will be a much faster experience for the user to get the AAString object directly from amazon s3. (If we decide to do so, the sourceurl/rdatapath/location_prefix for this case will be modified )

@lgatto - your thoughts?

lgatto commented 9 years ago

Title (point 1 above)

I will definitely update the titles.

MSnSet and mzTab

Re MSnSet types: the get1Method is indeed correct. The difference with mzTab is that mzTab need to be first read in with the readMzTabData function, which generates an MSnSet object. Currently, there is no MSnSet rda file for that project, but I anticipate that there might be in the future, which is why I added the setClass("MSnSetResource", contains="AnnotationHubResource") class definition. The get1 method for that class already existed, and was used for the mzTabResource class. The method you added is a duplicate, as far as I can see. Basically, the MSnSetResource and mzTabResource classes use the same get1 method to load serialised MSnSet objects.

The second file is an mzTab file (tab-delimited) that should be downloaded from the PRIDE server, read into R as an MSnSet and then saved on the amazon S3 for further consumption by the users.

regarding the mzid (third) file (point 2, prot3 above)

That file does not exists on PRIDE and should be manually added to the AH amazon S3. You can get it here.

I think the reason that the file is missing.

the fasta (forth) file (point 4)

I will make the changes to store it as an AAStringSet on the amazon S3.

I'll start with the easy part, the titles. Then, I will check the SourceUrl, RDataPath andLlocation_Prefix` fields in light of your comments above.

sonali-bioc commented 9 years ago

I think it might be easier if we create an issue per file type - less confusing ?

lgatto commented 9 years ago

Good point! Will more these to different issues.

lgatto commented 9 years ago

Closing this issue and proceed in issues #5 #6 #7 #8 and #9.