Closed lgatto closed 9 years ago
This recipe looks good too !
> tail(ah)
AnnotationHub with 6 records
# snapshotDate(): 2015-07-30
# $dataprovider: PRIDE, ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/
# $species: Erwinia carotovora, Lactobacillus jensenii_JV-V16, Methanocaldoc...
# $rdataclass: OrgDb, AAStringSet, MSnSet, mzRident, mzRpwiz
# additional mcols(): taxonomyid, genome, description, tags, sourceurl,
# sourcetype
# retrieve records with, e.g., 'object[["AH49004"]]'
title
AH49004 | org.Methanocaldococcus_infernus_ME.eg.sqlite
AH49005 | org.Lactobacillus_jensenii_JV-V16.eg.sqlite
AH49006 | PXD000001: Erwinia carotovora and spiked-in protein fasta file
AH49007 | PXD000001: Peptide-level quantitation data
AH49008 | PXD000001: raw mass spectrometry data
AH49009 | PXD000001: MS-GF+ identiciation data
> mzid <- ah[['AH49009']]
retrieving 1 resource
|======================================================================| 100%
There were 50 or more warnings (use warnings() to see the first 50)
> class(mzid)
[1] "mzRident"
attr(,"package")
[1] "mzR"
> mzid
Identification file handle.
Filename: 55315
Number of psms: 5759
The only issue with this one - is that the sourceurl doesnt contain the filename
It doesnt break the recipe - but the metadata looks incomplete without the filename.
> ah['AH49009']$sourceurl
[1] "http://s3.amazonaws.com/annotationhub/pride/data/archive/2012/03/PXD000001/"
FYI - Just for my our future notes - the file was downloaded from here and added to amazon s3 machine. ( which is why recipe=NA)
As far as I can see, the SourceUrl
is constructed by AnnotationHubData:::.ftpFileInfo
based on the filename (currently missing) and the server name + path. I assumed that it was not created properly because the file did not exists, but now it does. However, when debugging AnnotationHubData:::.ftpFileInfo
, I see that there seems to be more to it:
Browse[2]> filename
[1] "TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01-20141210.mzid"
[...]
Browse[2]>
debug: allurls <- lapply(url, function(ul) {
txt <- getURL(ul, dirlistonly = TRUE, curl = curl)
df2 <- strsplit(txt, "\n")[[1]]
df2 <- df2[grep(paste0(filename, "$"), df2)]
drop <- grepl("00-", df2)
df2 <- df2[!drop]
temp <- unlist(strsplit(df2, " "))
df2 <- temp[length(temp)]
paste0(ul, df2)
})
Browse[2]> ul <- url
Browse[2]> txt <- getURL(ul, dirlistonly = TRUE, curl = curl)
Browse[2]> txt
[1] "<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n<Error><Code>AccessDenied</Code><Message>Access Denied</Message><RequestId>E210AC0E32B4B206</RequestId><HostId>dPW1boesfMdXFU8toTsYBg8mLg1e7NC5V7ZCvOE+d+3jslBKDkZSbYhluhNTVJOrqVMBDc6LVk8=</HostId></Error>"
In other cases, txt
contains the list of file names available in the directory - directories on the Amazon S3 instance can't be listed.
Did you already encounter that situation? Would I need to hardcode SourceUrl
in such cases?
Yes, the amazon machine cannot be accessed using .ftpFileInfo() - so you should hardcode this fileName and not run the .ftpFileInfo() function on any file on the amazon s3 machine.
Yes, the amazon machine cannot be accessed using .ftpFileInfo() - so you should hardcode this fileName and not run the .ftpFileInfo() function on any file on the amazon s3 machine.
This should be fixed in commit b7ee542d43a1d985b7eb1da80611bc2dd4e60b18.
mzID
filemzIDfile
objectI think that the
SourceUrl
is missing the filename because it is not yet on the server, or just not accessible.