Python script calculating transposable element density for all genes in a genome. Publication: https://mobilednajournal.biomedcentral.com/articles/10.1186/s13100-022-00264-4
I ran the Arabidopsis genome set (the same one you did above) and got the following warning:
/mnt/ufs18/rs-004/edgerpat_lab/Scotty/TE_Density/transposon/gene_data.py:112: PerformanceWarning: your performance may suffer as PyTables will pickle object types that it cannot map directly to c-types [inferred_type->mixed,key->block0_values] [items->Index(['Chromosome', 'Feature', 'Start', 'Stop', 'Strand', 'Length', 'Genome_ID'],dtype='object')]
Basically the warning comes from import_filtered_genes.py and how it reads in some of the above columns. For the tings that are essentially strings it reads them as the object dtype in pandas, and apparently pytables/h5py doesn't like that for writing the h5 files during gene_data.write(). I got the above warning for each chromosome of data. SO i tried making the import filtered genes code even more explicit and have it read the string columns with pd.StringDtype() and that caused the code to crash with: TypeError: objects of type StringArray are not supported in this context, sorry; supported objects are: NumPy array, record or scalar; homogeneous list or tuple, integer, float, complex or bytes. I am not quite sure how to fix this, my intuition tells me that the solution must have something to do with how we declare the data types in the pandas.DataFrame before we try to write to hdf5.
TLDR: Pytables gives a performance warning when writing the cleaned gene data to HDF5 because it doesn't like the pandas object dtype. I tried setting alternative string dtypes in the pandas dataframe and that didn't work either. This may not be a big issue because it is merely a performance warning and we are only doing this once for each chromosome.
Copied from PR #117
I ran the Arabidopsis genome set (the same one you did above) and got the following warning:
/mnt/ufs18/rs-004/edgerpat_lab/Scotty/TE_Density/transposon/gene_data.py:112: PerformanceWarning: your performance may suffer as PyTables will pickle object types that it cannot map directly to c-types [inferred_type->mixed,key->block0_values] [items->Index(['Chromosome', 'Feature', 'Start', 'Stop', 'Strand', 'Length', 'Genome_ID'],dtype='object')]
Basically the warning comes from
import_filtered_genes.py
and how it reads in some of the above columns. For the tings that are essentially strings it reads them as the object dtype in pandas, and apparently pytables/h5py doesn't like that for writing the h5 files during gene_data.write(). I got the above warning for each chromosome of data. SO i tried making the import filtered genes code even more explicit and have it read the string columns with pd.StringDtype() and that caused the code to crash with: TypeError: objects of typeStringArray
are not supported in this context, sorry; supported objects are: NumPy array, record or scalar; homogeneous list or tuple, integer, float, complex or bytes. I am not quite sure how to fix this, my intuition tells me that the solution must have something to do with how we declare the data types in the pandas.DataFrame before we try to write to hdf5.TLDR: Pytables gives a performance warning when writing the cleaned gene data to HDF5 because it doesn't like the pandas
object
dtype. I tried setting alternative string dtypes in the pandas dataframe and that didn't work either. This may not be a big issue because it is merely a performance warning and we are only doing this once for each chromosome.