upgrade dependencies - Githubissues

teresi commented 1 year ago

upgrade numpy / scipy / pandas
upgrading h5py needs more changes to handle string conversions

testing

make sure to reinstall the dependencies first

$ make test
...
195 passed, 2 skipped, 50 warnings

$ ./process_genome.py ..//TE_Density_Filtered_Gene_and_TE_Annotations/Cleaned_TAIR10_GFF3_genes_main_chromosomes.tsv ../TE_Density_Filtered_Gene_and_TE_Annotations/Cleaned_TAIR10_chr_main_chromosomes.fas.mod.EDTA.TEanno.tsv adiposetoperus -n 2 --output_dir ../TE_Density_Filtered_Gene_and_TE_Annotations/results
...
subsets: 100%|██████████████████████████████████| 30/30 [09:57<00:00, 19.90s/it]
2022-09-17 11:32:25 GOKU __main__[18223] INFO process density... complete

if you have a short job you can submit that would be helpful to double check it's all good

sjteresi commented 1 year ago

Affirmative.

I have re-installed the dependencies and begun running a small strawberry genome on MSU's HPCC, this should be a sufficient system test. I will follow up when it is done.

sjteresi commented 1 year ago

Hello Michael,

I was able to successfully run the pipeline. And I re-installed the dependencies. However there were some discrepancies we should address:

I had 195 passing tests, 2 skipped, but only 40 warnings compared to your 50 when I ran make test.
I ran the Arabidopsis genome set (the same one you did above) and got the following warning:
- /mnt/ufs18/rs-004/edgerpat_lab/Scotty/TE_Density/transposon/gene_data.py:112: PerformanceWarning: your performance may suffer as PyTables will pickle object types that it cannot map directly to c-types [inferred_type->mixed,key->block0_values] [items->Index(['Chromosome', 'Feature', 'Start', 'Stop', 'Strand', 'Length', 'Genome_ID'],dtype='object')]
- Basically the warning comes from import_filtered_genes.py and how it reads in some of the above columns. For the tings that are essentially strings it reads them as the object dtype in pandas, and apparently pytables/h5py doesn't like that for writing the h5 files during gene_data.write(). I got the above warning for each chromosome of data. SO i tried making the import filtered genes code even more explicit and have it read the string columns with pd.StringDtype() and that caused the code to crash with: TypeError: objects of type ``StringArray`` are not supported in this context, sorry; supported objects are: NumPy array, record or scalar; homogeneous list or tuple, integer, float, complex or bytes. I am not quite sure how to fix this, my intuition tells me that the solution must have something to do with how we declare the data types in the pandas.DataFrame before we try to write to hdf5.
Odd that this warning didn't occur earlier but I guess updating some internal package made it so that the warning started showing up...I did not re-make the preprocessed files (the clean files) before I ran the system test. I don't think that would matter though...
Strawberry worked similarly to the Arabidopsis set

sjteresi commented 1 year ago

We talked about getting memory usage statistics for the jobs, I tried using Python procpath and had a lot of issues. We may need to talk about that over the phone.

However I have had a lot of success using these methods: https://hpc.nmsu.edu/discovery/slurm/job-management/#:~:text=shell%20Copied!-,The%20%22seff%22%20command(Slurm%20Job%20Efficiency%20Report),this%20might%20report%20incorrect%20information.

Here is the info for the Arabidopsis set:

JobID           JobName  ReqCPUS     ReqMem     AveRSS     MaxRSS    Elapsed                State ExitCode 
------------ ---------- -------- ---------- ---------- ---------- ---------- -------------------- -------- 
63138469     TEST_Arab+        5        40G                         00:23:16            COMPLETED      0:0 
63138469.ba+      batch        5                32.24G     32.24G   00:23:16            COMPLETED      0:0 
63138469.ex+     extern        5                     0          0   00:23:16            COMPLETED      0:0

Unfortunately this does not give real-time feedback and only works when the job dies or completes (because I need the job to be finished before I can run the command). What do you think? Is this sufficient?

teresi commented 1 year ago

as far as pandas / strings go, we aren't sending pandas objects between processes so I don't expect we'll hit performance issues /w pickle (the objects are pickled when sent to another process) and even so, that would only happen a few times if it was getting sent over so we can address that in another issue

for the HPC / squeue command that should be sufficient for now

is that the result for this branch? would it be easy to run the test on the previous commit too? I don't expect it will change much of the memory usage but it would be interesting to see

sjteresi commented 1 year ago

Ok, I will make an issue and mental note of the pytables/pickle issue to address later.
Yes, that is the result of the current branch. You want me to generate the memory usage info for commit: 4b4d7c091733d2fbe2e537fb7486c50816e55037 ?

sjteresi / TE_Density

upgrade dependencies #117

testing