New AnnotationHubDispatchClassList?

lgatto commented 1 year ago

When creating an ExperimentHub package, it is possible to define dispatch classes, so that some file types can be loaded automatically and returned as predefined objects. See AnnotationHub::DispatchClassList():

DispatchClass	Reader
FaFile	Rsamtools::FaFile(); requires rtracklayer
BamFile	Rsamtools::BamFile(); requires rtracklayer
Rds	readRDS()
RDS	readRDS()
Rda	get(load())
data.frame	get(load())
GRanges	get(load()); requires GenomicRanges
VCF	get(load()); requires VariantAnnotation
ChainFile	rtracklayer::import.chain(); requires rtracklayer and GenomeInfoDb; before using import.chain internally uses gzfile and writeBin to extract data from file; files saved as chain.gz
TwoBitFile	rtracklayer::TwoBitFile(); requires rtracklayer
GFFFile	rtracklayer::import(); require rtracklayer and GenomeInfoDB; after import converts to GRanges object
GFF3File	rtracklayer::import(); require rtracklayer
BigWig	rtracklayer::BigWigFile(); require rtracklayer
dbSNPVCFFile	VariantAnnotation::VcfFile(); require VariantAnnotation; files saved as vcf.gz and vcf.gz.tbi
SQLiteFile	AnnotationDbi::loadDb(); files saved as sqlite
GRASP	dbFileConnect()
Zip	unzip(); returns file path to files
ChEA	unzip(); returns data.frame from reading chea-background.csv
BioPax	get(load()); require rBiopaxParser
Pazar	read.delim(); require GenomicRanges; reads specific columns from file and coverts to GRanges object
CSVtoGranges	read.csv(); require GenomicRanges; coverts data.frame to GRanges object
ExpressionSet	get(load()); require Biobase
GDS	gdsfmt::openfn.gds(); require gdsfmt
H5File	require rhdf5; resource downloaded but not loaded; returns file path
FilePath	resource downloaded but not loaded; returns file path
BEDFile	rtracklayer::import(rtracklayer::BEDFile()); require rtracklayer; converts to GRanges object
UCSCBroadPeak	rtracklayer::import(rtracklayer::BEDFile()); require rtracklayer; converts to GRanges object
UCSCNarrowPeak	rtracklayer::import(rtracklayer::BEDFile()); require rtracklayer; converts to GRanges object
UCSCBEDRnaElements	rtracklayer::import(rtracklayer::BEDFile()); require rtracklayer; converts to GRanges object
UCSCGappedPeak	rtracklayer::import(rtracklayer::BEDFile()); require rtracklayer; converts to GRanges object
EpiMetadata	read.delim()
EpiExpressionText	read.table(); converts to SummarizedExperiment object
EpichmmModels	rtracklayer::import(); calls additional helper AnnotationHub:::.mapAbbr2FullName and then converts to GRange object; file assumed to be bed file format
EpigenomeRoadmapFile	rtracklayer::import(); converts to GRange object; file assumed to be bed file format
EpigenomeRoadmapNarrowAllPeaks	rtracklayer::import(rtracklayer::BEDFile()); require rtracklayer; converts to GRanges object
EpigenomeRoadmapNarrowFDR	rtracklayer::import(rtracklayer::BEDFile()); require rtracklayer; converts to GRanges object
EnsDb	ensembldb::EnsDb(); require ensembldb
mzRpwiz	mzR::openMSfile(); require mzR
mzRident	mzR::openIDfile(); require mzR
MSnSet	get(load()); require MSnbase
AAStringSet	Biostrings::readAAStringSet(); require Biostrings

For MsDataHub, we want "FilePath", as we want to get the file path and then load the data ourselves. We could also directly get the desired object, for example a Spectra object created by Spectra() if the file is an mzML.

Should we ask to add Spectra (and possibly others such as PSM for mzid files) to the default dispatch classes?

Ping @jorainer

jorainer commented 1 year ago

Or should we go directly for MsExperiment instead? IMO a spectra without sample information might not be too useful.

lgatto commented 1 year ago

I suppose you refer to this issue.

But we can't necessarily anticipate what the developer is sharing their data for. And your suggestion requires two inputs (and mzML and the sample annotation), and I'm not sure this fits the bill here, as the hub infrastructure is mean to share (individual) files. To fit your suggestion, we should share two files, one that could be loaded as a Spectra object directly (as per my message above) and a second one loaded as a data.frame, and both can be used to construct an MsExperiment.

jorainer commented 1 year ago

Hm, agree - and needing two separate files would not be ideal. So, we might go for Spectra and have one Spectra object for each mzML file then?

lgatto commented 1 year ago

Yes, I think that's the basic idea - I share a file and it get loaded automatically as the best object. If, as a developer, I want a Spectra object containing data from multiple files, it would be my job to create that files beforehand.

jorainer commented 1 year ago

yes. makes sense.

jorainer commented 1 year ago

I want a Spectra object containing data from multiple files, it would be my job to create that files beforehand.

Or simply join the Spectra from the individual files using c.

rformassspectrometry / MsDataHub

New AnnotationHubDispatchClassList? #6