Closed teresi closed 1 year ago
Affirmative.
I have re-installed the dependencies and begun running a small strawberry genome on MSU's HPCC, this should be a sufficient system test. I will follow up when it is done.
Hello Michael,
I was able to successfully run the pipeline. And I re-installed the dependencies. However there were some discrepancies we should address:
make test
./mnt/ufs18/rs-004/edgerpat_lab/Scotty/TE_Density/transposon/gene_data.py:112: PerformanceWarning: your performance may suffer as PyTables will pickle object types that it cannot map directly to c-types [inferred_type->mixed,key->block0_values] [items->Index(['Chromosome', 'Feature', 'Start', 'Stop', 'Strand', 'Length', 'Genome_ID'],dtype='object')]
import_filtered_genes.py
and how it reads in some of the above columns. For the tings that are essentially strings it reads them as the object
dtype in pandas, and apparently pytables/h5py doesn't like that for writing the h5 files during gene_data.write()
. I got the above warning for each chromosome of data. SO i tried making the import filtered genes code even more explicit and have it read the string columns with pd.StringDtype() and that caused the code to crash with: TypeError: objects of type ``StringArray`` are not supported in this context, sorry; supported objects are: NumPy array, record or scalar; homogeneous list or tuple, integer, float, complex or bytes
. I am not quite sure how to fix this, my intuition tells me that the solution must have something to do with how we declare the data types in the pandas.DataFrame before we try to write to hdf5.We talked about getting memory usage statistics for the jobs, I tried using Python procpath and had a lot of issues. We may need to talk about that over the phone.
However I have had a lot of success using these methods: https://hpc.nmsu.edu/discovery/slurm/job-management/#:~:text=shell%20Copied!-,The%20%22seff%22%20command(Slurm%20Job%20Efficiency%20Report),this%20might%20report%20incorrect%20information.
Here is the info for the Arabidopsis set:
JobID JobName ReqCPUS ReqMem AveRSS MaxRSS Elapsed State ExitCode
------------ ---------- -------- ---------- ---------- ---------- ---------- -------------------- --------
63138469 TEST_Arab+ 5 40G 00:23:16 COMPLETED 0:0
63138469.ba+ batch 5 32.24G 32.24G 00:23:16 COMPLETED 0:0
63138469.ex+ extern 5 0 0 00:23:16 COMPLETED 0:0
Unfortunately this does not give real-time feedback and only works when the job dies or completes (because I need the job to be finished before I can run the command). What do you think? Is this sufficient?
as far as pandas / strings go, we aren't sending pandas objects between processes so I don't expect we'll hit performance issues /w pickle (the objects are pickled when sent to another process) and even so, that would only happen a few times if it was getting sent over so we can address that in another issue
for the HPC / squeue
command that should be sufficient for now
is that the result for this branch? would it be easy to run the test on the previous commit too? I don't expect it will change much of the memory usage but it would be interesting to see
testing
make sure to reinstall the dependencies first
if you have a short job you can submit that would be helpful to double check it's all good