sjteresi / TE_Density

Python script calculating transposable element density for all genes in a genome. Publication: https://mobilednajournal.biomedcentral.com/articles/10.1186/s13100-022-00264-4
GNU General Public License v3.0
28 stars 4 forks source link

upgrade h5py #121

Closed teresi closed 1 year ago

teresi commented 1 year ago

DESIGN

Python 3.11 just came out so I wanted to see if we could upgrade.

One issue was that we had an old h5py (2.10 vs 3.x), so I figured we should fix that before upgrading.

FUTURE

We should probably remove scipy since that was only required from an example and it gave me issues.

I had to install these on 22.04 but will need to try again after the H5PY upgrade:

The DensityData should be refactored to the MergeData (would be best to call them both DensityData) but we should probably hold off on that until we refactor it, we can talk more about that later

TESTING

(genes) [22:01 GOKU][f/upgrade_h5py]:/data/genes/TE_Density
[ins]▸$ make test
...
================================================================= 195 passed, 2 skipped, 49 warnings in 2.62 seconds =================================================================
(genes) [21:59 GOKU][f/upgrade_h5py]:/data/genes/TE_Density
[cmd]▸$
./process_genome.py ..//TE_Density_Filtered_Gene_and_TE_Annotations/Cleaned_TAIR10_GFF3_genes_main_chromosomes.tsv ../TE_Density_Filtered_Gene_and_TE_Annotations/Cleaned_TAIR10_chr_main_chromosomes.fas.mod.EDTA.TEanno.tsv adiposetoperus -n 4 --output_dir ../TE_Density_Filtered_Gene_and_TE_Annotations/results
...
subsets: 100%|██████████████████████████████████| 30/30 [04:09<00:00,  8.31s/it]
2022-10-26 22:03:58 GOKU __main__[30620] INFO process density... complete
sjteresi commented 1 year ago

Hi Michael,

Main

As we discussed over text, this commit breaks the reading of OLD TE Density output data. So users would have to re-generate output data if they want to interrogate it with the new DensityData class. So perhaps we should consider changing the version number. I'll leave that determination and change up to you.

Other than that, I had to add in some code to the DensityData and MergeData files to accommodate the string change. I mimicked your str decode method in the DensityData init section and applied it to the chromosome ID, and the order and superfamily TE strings. That was needed because those things were also being converted into byte strings and needed the decoder.

Finally, I had to modify how we pass the chromosome ID string as an input arg in write_vlen_str_h5py in merge_data.py. The chromosome needed to be an list of strings iterable because when it was just a pure string it was getting broken up during the byte str operation. E.g 'Chr1' was becoming b'C', b'h', b'r', b'1' and then DensityData would yell at me because it was considering those multiple unique chromosomes.

Future

I had to manually investigate data with DensityData to figure all of this out. SO even though the tests "worked" it obscured that this update would break things. I'll look into writing more tests but may need some assistance there...

I will also begin removing scipy from the requirements.

teresi commented 1 year ago

ok, I'll take a look

since I didn't see that in our tests, we'll need that added