sjteresi / TE_Density

Python script calculating transposable element density for all genes in a genome. Publication: https://mobilednajournal.biomedcentral.com/articles/10.1186/s13100-022-00264-4
GNU General Public License v3.0
28 stars 4 forks source link

upgrade dependencies #117

Closed teresi closed 1 year ago

teresi commented 1 year ago

testing

make sure to reinstall the dependencies first

$ make test
...
195 passed, 2 skipped, 50 warnings
$ ./process_genome.py ..//TE_Density_Filtered_Gene_and_TE_Annotations/Cleaned_TAIR10_GFF3_genes_main_chromosomes.tsv ../TE_Density_Filtered_Gene_and_TE_Annotations/Cleaned_TAIR10_chr_main_chromosomes.fas.mod.EDTA.TEanno.tsv adiposetoperus -n 2 --output_dir ../TE_Density_Filtered_Gene_and_TE_Annotations/results
...
subsets: 100%|██████████████████████████████████| 30/30 [09:57<00:00, 19.90s/it]
2022-09-17 11:32:25 GOKU __main__[18223] INFO process density... complete

if you have a short job you can submit that would be helpful to double check it's all good

sjteresi commented 1 year ago

Affirmative.

I have re-installed the dependencies and begun running a small strawberry genome on MSU's HPCC, this should be a sufficient system test. I will follow up when it is done.

sjteresi commented 1 year ago

Hello Michael,

I was able to successfully run the pipeline. And I re-installed the dependencies. However there were some discrepancies we should address:

sjteresi commented 1 year ago

We talked about getting memory usage statistics for the jobs, I tried using Python procpath and had a lot of issues. We may need to talk about that over the phone.

However I have had a lot of success using these methods: https://hpc.nmsu.edu/discovery/slurm/job-management/#:~:text=shell%20Copied!-,The%20%22seff%22%20command(Slurm%20Job%20Efficiency%20Report),this%20might%20report%20incorrect%20information.

Here is the info for the Arabidopsis set:

JobID           JobName  ReqCPUS     ReqMem     AveRSS     MaxRSS    Elapsed                State ExitCode 
------------ ---------- -------- ---------- ---------- ---------- ---------- -------------------- -------- 
63138469     TEST_Arab+        5        40G                         00:23:16            COMPLETED      0:0 
63138469.ba+      batch        5                32.24G     32.24G   00:23:16            COMPLETED      0:0 
63138469.ex+     extern        5                     0          0   00:23:16            COMPLETED      0:0

Unfortunately this does not give real-time feedback and only works when the job dies or completes (because I need the job to be finished before I can run the command). What do you think? Is this sufficient?

teresi commented 1 year ago

as far as pandas / strings go, we aren't sending pandas objects between processes so I don't expect we'll hit performance issues /w pickle (the objects are pickled when sent to another process) and even so, that would only happen a few times if it was getting sent over so we can address that in another issue

for the HPC / squeue command that should be sufficient for now

is that the result for this branch? would it be easy to run the test on the previous commit too? I don't expect it will change much of the memory usage but it would be interesting to see

sjteresi commented 1 year ago