sjteresi / TE_Density

Python script calculating transposable element density for all genes in a genome. Publication: https://mobilednajournal.biomedcentral.com/articles/10.1186/s13100-022-00264-4
GNU General Public License v3.0
28 stars 4 forks source link

Early h5py methods #7

Closed sjteresi closed 4 years ago

sjteresi commented 4 years ago

I wrote some methods for writing and reading hdf5 files. I thought that it would be good to accept a Pandaframe (and use Pandas' methods for working with hdf5) as the input since we are initially using, constructing and wrapping Pandaframes. If you don't like that I can look into using the numpy methods or the h5py methods of writing and reading hdf5 files.

I wrote some docstrings for the read and write commands but they are little verbose. I can clarify if needed, but essentially when you construct an hdf5 file, you need to add a key or identifier for that given chunk of the hdf5. This is the "key" keyword argument during the read and write functions.

I propose using a string of the chromosome (and potentially the window size too) as the key, as I think that is the best identifier for a dataset. So to start working towards that I added an attribute called "chrom_of_the_subset" in the init sections of TransposonData and GeneData. It returns a string of the 1 chromosome for that subset and later we can use that string to label or access that specific chunk of data in the hdf5.

Finally, I don't really like how the "read" function is a staticmethod but the "write" function is an instance method, because that makes their usage syntax a teensy bit different. However I think it is the most clear. You may want to change this, let me know what you think.