MzID file - Githubissues

lgatto commented 9 years ago

lives on the AH S3 server as an mzID file
read directly and served directly to the user as an mzIDfile object

class: AnnotationHubMetadata
AnnotationHubRoot: /var/FastRWeb/web
BiocVersion: 3.2
Coordinate_1_based: TRUE
DataProvider: PRIDE
DerivedMd5: NA
Description: Four human TMT spliked-in proteins in an Erwinia
  carotovora background. Expected reporter ion ratios: Erwinia peptides:
  1:1:1:1:1:1; Enolase spike (sp|P00924|ENO1_YEAST): 10:5:2.5:1:2.5:10;
  BSA spike (sp|P02769|ALBU_BOVIN): 1:2.5:5:10:5:1; PhosB spike
  (sp|P00489|PYGM_RABIT): 2:2:2:2:1:1; Cytochrome C spike
  (sp|P62894|CYC_BOVIN): 1:1:1:1:1:2.
DispatchClass: mzRident
Genome: NA
Location_Prefix: http://s3.amazonaws.com/annotationhub/
Maintainer: Laurent Gatto <lg390@cam.ac.uk>
Notes: NA
PreparerClass: PXD000001MzidToMzRidentPreparer
RDataClass: mzRident
RDataDateAdded: 2015-07-30
RDataPath:
  pride/data/archive/2012/03/PXD000001/TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01-20141210.mzid
Recipe: NA
SourceLastModifiedDate: NA
SourceMd5: NA
SourceSize: NA
SourceType: mzid
SourceUrl:
  http://s3.amazonaws.com/annotationhub/pride/data/archive/2012/03/PXD000001/
SourceVersion: 2015-07-29 23:58:20
Species: Erwinia carotovora
Tags: Proteomics TMT6 LTQ Orbitrap Velos PMID:23692960
TaxonomyId: 554
Title: PXD000001: MS-GF+ identiciation data

I think that the SourceUrl is missing the filename because it is not yet on the server, or just not accessible.

sonali-bioc commented 9 years ago

This recipe looks good too !

> tail(ah)
AnnotationHub with 6 records
# snapshotDate(): 2015-07-30
# $dataprovider: PRIDE, ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/
# $species: Erwinia carotovora, Lactobacillus jensenii_JV-V16, Methanocaldoc...
# $rdataclass: OrgDb, AAStringSet, MSnSet, mzRident, mzRpwiz
# additional mcols(): taxonomyid, genome, description, tags, sourceurl,
#   sourcetype
# retrieve records with, e.g., 'object[["AH49004"]]'

            title
  AH49004 | org.Methanocaldococcus_infernus_ME.eg.sqlite
  AH49005 | org.Lactobacillus_jensenii_JV-V16.eg.sqlite
  AH49006 | PXD000001: Erwinia carotovora and spiked-in protein fasta file
  AH49007 | PXD000001: Peptide-level quantitation data
  AH49008 | PXD000001: raw mass spectrometry data
  AH49009 | PXD000001: MS-GF+ identiciation data
> mzid <- ah[['AH49009']]
retrieving 1 resource
  |======================================================================| 100%
There were 50 or more warnings (use warnings() to see the first 50)
> class(mzid)
[1] "mzRident"
attr(,"package")
[1] "mzR"
> mzid
Identification file handle.
Filename:  55315
Number of psms:  5759

sonali-bioc commented 9 years ago

The only issue with this one - is that the sourceurl doesnt contain the filename

It doesnt break the recipe - but the metadata looks incomplete without the filename.

> ah['AH49009']$sourceurl
[1] "http://s3.amazonaws.com/annotationhub/pride/data/archive/2012/03/PXD000001/"

sonali-bioc commented 9 years ago

FYI - Just for my our future notes - the file was downloaded from here and added to amazon s3 machine. ( which is why recipe=NA)

lgatto commented 9 years ago

As far as I can see, the SourceUrl is constructed by AnnotationHubData:::.ftpFileInfo based on the filename (currently missing) and the server name + path. I assumed that it was not created properly because the file did not exists, but now it does. However, when debugging AnnotationHubData:::.ftpFileInfo, I see that there seems to be more to it:

Browse[2]> filename
[1] "TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01-20141210.mzid"
[...]
Browse[2]> 
debug: allurls <- lapply(url, function(ul) {
    txt <- getURL(ul, dirlistonly = TRUE, curl = curl)
    df2 <- strsplit(txt, "\n")[[1]]
    df2 <- df2[grep(paste0(filename, "$"), df2)]
    drop <- grepl("00-", df2)
    df2 <- df2[!drop]
    temp <- unlist(strsplit(df2, " "))
    df2 <- temp[length(temp)]
    paste0(ul, df2)
})
Browse[2]> ul <- url
Browse[2]> txt <- getURL(ul, dirlistonly = TRUE, curl = curl)
Browse[2]> txt
[1] "<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n<Error><Code>AccessDenied</Code><Message>Access Denied</Message><RequestId>E210AC0E32B4B206</RequestId><HostId>dPW1boesfMdXFU8toTsYBg8mLg1e7NC5V7ZCvOE+d+3jslBKDkZSbYhluhNTVJOrqVMBDc6LVk8=</HostId></Error>"

In other cases, txt contains the list of file names available in the directory - directories on the Amazon S3 instance can't be listed.

Did you already encounter that situation? Would I need to hardcode SourceUrl in such cases?

sonali-bioc commented 9 years ago

Yes, the amazon machine cannot be accessed using .ftpFileInfo() - so you should hardcode this fileName and not run the .ftpFileInfo() function on any file on the amazon s3 machine.

lgatto commented 9 years ago

Yes, the amazon machine cannot be accessed using .ftpFileInfo() - so you should hardcode this fileName and not run the .ftpFileInfo() function on any file on the amazon s3 machine.

This should be fixed in commit b7ee542d43a1d985b7eb1da80611bc2dd4e60b18.

lgatto / ProteomicsAnnotationHubData

MzID file #8