rformassspectrometry / MsDataHub

Mass Spectrometry Data on ExperimentHub
https://rformassspectrometry.github.io/MsDataHub/
1 stars 1 forks source link

New AnnotationHubDispatchClassList? #6

Open lgatto opened 1 year ago

lgatto commented 1 year ago

When creating an ExperimentHub package, it is possible to define dispatch classes, so that some file types can be loaded automatically and returned as predefined objects. See AnnotationHub::DispatchClassList():

DispatchClass Reader
FaFile Rsamtools::FaFile(); requires rtracklayer
BamFile Rsamtools::BamFile(); requires rtracklayer
Rds readRDS()
RDS readRDS()
Rda get(load())
data.frame get(load())
GRanges get(load()); requires GenomicRanges
VCF get(load()); requires VariantAnnotation
ChainFile rtracklayer::import.chain(); requires rtracklayer and GenomeInfoDb; before using import.chain internally uses gzfile and writeBin to extract data from file; files saved as chain.gz
TwoBitFile rtracklayer::TwoBitFile(); requires rtracklayer
GFFFile rtracklayer::import(); require rtracklayer and GenomeInfoDB; after import converts to GRanges object
GFF3File rtracklayer::import(); require rtracklayer
BigWig rtracklayer::BigWigFile(); require rtracklayer
dbSNPVCFFile VariantAnnotation::VcfFile(); require VariantAnnotation; files saved as vcf.gz and vcf.gz.tbi
SQLiteFile AnnotationDbi::loadDb(); files saved as sqlite
GRASP dbFileConnect()
Zip unzip(); returns file path to files
ChEA unzip(); returns data.frame from reading chea-background.csv
BioPax get(load()); require rBiopaxParser
Pazar read.delim(); require GenomicRanges; reads specific columns from file and coverts to GRanges object
CSVtoGranges read.csv(); require GenomicRanges; coverts data.frame to GRanges object
ExpressionSet get(load()); require Biobase
GDS gdsfmt::openfn.gds(); require gdsfmt
H5File require rhdf5; resource downloaded but not loaded; returns file path
FilePath resource downloaded but not loaded; returns file path
BEDFile rtracklayer::import(rtracklayer::BEDFile()); require rtracklayer; converts to GRanges object
UCSCBroadPeak rtracklayer::import(rtracklayer::BEDFile()); require rtracklayer; converts to GRanges object
UCSCNarrowPeak rtracklayer::import(rtracklayer::BEDFile()); require rtracklayer; converts to GRanges object
UCSCBEDRnaElements rtracklayer::import(rtracklayer::BEDFile()); require rtracklayer; converts to GRanges object
UCSCGappedPeak rtracklayer::import(rtracklayer::BEDFile()); require rtracklayer; converts to GRanges object
EpiMetadata read.delim()
EpiExpressionText read.table(); converts to SummarizedExperiment object
EpichmmModels rtracklayer::import(); calls additional helper AnnotationHub:::.mapAbbr2FullName and then converts to GRange object; file assumed to be bed file format
EpigenomeRoadmapFile rtracklayer::import(); converts to GRange object; file assumed to be bed file format
EpigenomeRoadmapNarrowAllPeaks rtracklayer::import(rtracklayer::BEDFile()); require rtracklayer; converts to GRanges object
EpigenomeRoadmapNarrowFDR rtracklayer::import(rtracklayer::BEDFile()); require rtracklayer; converts to GRanges object
EnsDb ensembldb::EnsDb(); require ensembldb
mzRpwiz mzR::openMSfile(); require mzR
mzRident mzR::openIDfile(); require mzR
MSnSet get(load()); require MSnbase
AAStringSet Biostrings::readAAStringSet(); require Biostrings

For MsDataHub, we want "FilePath", as we want to get the file path and then load the data ourselves. We could also directly get the desired object, for example a Spectra object created by Spectra() if the file is an mzML.

Should we ask to add Spectra (and possibly others such as PSM for mzid files) to the default dispatch classes?

Ping @jorainer

jorainer commented 1 year ago

Or should we go directly for MsExperiment instead? IMO a spectra without sample information might not be too useful.

lgatto commented 1 year ago

I suppose you refer to this issue.

But we can't necessarily anticipate what the developer is sharing their data for. And your suggestion requires two inputs (and mzML and the sample annotation), and I'm not sure this fits the bill here, as the hub infrastructure is mean to share (individual) files. To fit your suggestion, we should share two files, one that could be loaded as a Spectra object directly (as per my message above) and a second one loaded as a data.frame, and both can be used to construct an MsExperiment.

jorainer commented 1 year ago

Hm, agree - and needing two separate files would not be ideal. So, we might go for Spectra and have one Spectra object for each mzML file then?

lgatto commented 1 year ago

Yes, I think that's the basic idea - I share a file and it get loaded automatically as the best object. If, as a developer, I want a Spectra object containing data from multiple files, it would be my job to create that files beforehand.

jorainer commented 1 year ago

yes. makes sense.

jorainer commented 1 year ago

I want a Spectra object containing data from multiple files, it would be my job to create that files beforehand.

Or simply join the Spectra from the individual files using c.