feature request: documentation and simple example for writing recipe classes for other types of datasets

lila commented 2 years ago

I'm interested in using pangeo-forge for other types of datasets, specifically genomic datasets. The current set of recipe classes are focused on climate-oriented data formats (netcdf, hdf, zarr). It would be useful to

1) have more documentation and a simple example of how to write custom recipe classes to support other datasets

2) include a simple null-transformation recipe class that simply copies the filesets into the storage without any transformation or rechunking.

As an example, consider the 1000 genomes datasets available https://ftp-trace.ncbi.nih.gov/1000genomes/ftp/ and if you browse around there, you will find files that look like:

HG00096.chrom20.ILLUMINA.bwa.GBR.low_coverage.20101123.bam

here, HG00096 represents an individual, chrom20 is chromosome 20, etc. So this concept of FilePatterns that you have with climate data is relevant for this dataset as well, but is a string value not a numerical value like day or date.

You also have this notion of "rechunking" the data to be cloud-optimized, and that is also relevant for genome data. For example, with read files you get out of the sequencers, it is very common to repartition them based on the type of analysis you want to do and in particular the degree of parallelism you run. So a full human genome read file may be 100GiB, but if you are running this with 100 way parallelism, you might rechunk it into 1GiB sizes. And the rechunking, similar to what you do with netcdf/hdf/zarr, needs to be data structure aware -- you don't want to split individual data structures as that would invalidate the formats.

So there is value to the FilePatterns and the rechunking. But perhaps a simple first step would be to have a recipe that does no transformation at all. With a specific FilePattern, the bakery would then just simply copy the files from the source to the target and add the metadata, provenance etc.

It would be ideal if such a Recipe class existed, and that is my feature request.

Alternatively, if there was more documentation on how to create Recipe classes, that would enable more advanced transformations for other domains.

Any pointers to docs or code are appreciated...

k

martindurant commented 2 years ago

How do you currently access these BAM files from python? google suggests pysam.

An extremely quick glance at the format suggests that the files are indeed internally chunked in "aligned sequences" of size ~100MB, so ideal for our purposes. Do you suppose that the data might be representable in the netCDF/zarr data model, or do you need something completely different? I also see that sometimes a BAM file has a corresponding "index" that might be something to similar to what kerchunk does, and would be very useful.

lila commented 2 years ago

you are correct, the BAM files represent the alignment of reads to a reference. Those are numerical. However, the data you get out of sequencers are FASTA or FASTQ files. FASTA files look like this (for protein sequences):

>gi|186681228|ref|YP_001864424.1| phycoerythrobilin:ferredoxin oxidoreductase
MNSERSDVTLYQPFLDYAIAYMRSRLDLEPYPIPTGFESNSAVVGKGKNQEEVVTTSYAFQTAKLRQIRA
AHVQGGNSLQVLNFVIFPHLNYDLPFFGADLVTLPGGHLIALDMQPLFRDDSAYQAKYTEPILPIFHAHQ
QHLSWGGDFPEEAQPFFSPAFLWTRPQETAVVETQVFAAFKDYLKAYLDFVEQAEAVTDSQNLVAIKQAQ
LRYLRYRAEKDPARGMFKRFYGAEWTEEYIHGFLFDLERKLTVVK

for DNA sequences a FASTQ looks like:

@SEQ_ID
GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT
+
!''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65

the first line is the sequence (ATGC) and the second line is a numerical measure of the quality of the base call.

One of the challenges in changing the format of the files would be that the existing tools read FASTA/FASTQ/BAM files. So my thought is just to publish as is. hence the request for a Null transform recipe class.

k

martindurant commented 2 years ago

Yes, it might make sense to have a simple pass-though recipe, which is then just a bulk copy. I leave that to other pangeo-forge admins. However, you also mentioned parallelising over many inputs, and that's where chunks and kerchunk could come in.

rabernat commented 2 years ago

Thanks for the issue @lila! It should be trivial to implement the sort of recipe you have in mind. I think we can implement this within a few weeks.

I also recently learned about skyplane, which seems to be an amazing new tool for moving files around the cloud.

@martindurant is right--if it is possible to represent this data as a virtual Zarr dataset, it will be a huge win. Have you seen sgkit? Is there any overlap in terms of data structure and format?

martindurant commented 2 years ago

(I posted a "hello" to skyplane at https://github.com/skyplane-project/skyplane/issues/623 )

lila commented 2 years ago

I'm happy to write it and contribute as a pr, if there are any docs regarding the interface between recipes and the bakery, please point them out. i'll poke around and see what i can do.

k

pangeo-forge / pangeo-forge-recipes

feature request: documentation and simple example for writing recipe classes for other types of datasets #422