sjteresi / TE_Density

Python script calculating transposable element density for all genes in a genome. Publication: https://mobilednajournal.biomedcentral.com/articles/10.1186/s13100-022-00264-4
GNU General Public License v3.0
28 stars 4 forks source link

Pytables Pickle warning for gene data #118

Closed sjteresi closed 1 year ago

sjteresi commented 1 year ago

Copied from PR #117

I ran the Arabidopsis genome set (the same one you did above) and got the following warning: /mnt/ufs18/rs-004/edgerpat_lab/Scotty/TE_Density/transposon/gene_data.py:112: PerformanceWarning: your performance may suffer as PyTables will pickle object types that it cannot map directly to c-types [inferred_type->mixed,key->block0_values] [items->Index(['Chromosome', 'Feature', 'Start', 'Stop', 'Strand', 'Length', 'Genome_ID'],dtype='object')]

Basically the warning comes from import_filtered_genes.py and how it reads in some of the above columns. For the tings that are essentially strings it reads them as the object dtype in pandas, and apparently pytables/h5py doesn't like that for writing the h5 files during gene_data.write(). I got the above warning for each chromosome of data. SO i tried making the import filtered genes code even more explicit and have it read the string columns with pd.StringDtype() and that caused the code to crash with: TypeError: objects of type StringArray are not supported in this context, sorry; supported objects are: NumPy array, record or scalar; homogeneous list or tuple, integer, float, complex or bytes. I am not quite sure how to fix this, my intuition tells me that the solution must have something to do with how we declare the data types in the pandas.DataFrame before we try to write to hdf5.

TLDR: Pytables gives a performance warning when writing the cleaned gene data to HDF5 because it doesn't like the pandas object dtype. I tried setting alternative string dtypes in the pandas dataframe and that didn't work either. This may not be a big issue because it is merely a performance warning and we are only doing this once for each chromosome.

sjteresi commented 1 year ago

@teresi Seems like this was addressed in the linked PR. I am closing